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This  dissertation  is  the  result  of  work  done  in  conjunction  with  a  research  project 
concerning  the  development  of  an  image  algebra,  an  algebraic  structure  for  digital  image 
processing.  A  description  of  the  use  of  the  image  algebra  as  a  model  and  tool  for  the 
development  of  local,  parallel  algorithms  is  presented.  Specifically,  the  image  algebra  is 
used  as  a  model  for  local  computation  of  linear  transformations,  and  algorithms  for  local 
computation  of  the  discrete  Fourier  transform  are  developed  using  traditional  fast 
transform  methods  as  well  as  numerical  techniques. 

The  basic  operators  and  operands  of  an  image  algebra  are  defined.  Relationships 
between  the  image  algebra  and  matrix  algebra  are  described.  Networks  of  processors  are 
modeled  as  directed  graphs.    Images  are  represented  as  functions  defined  on  the  nodes  of 
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a  directed  graph.  It  is  shown  that  every  linear  image  to  image  transform  can  be  factored 
into  a  product  of  linear  transformations  which  are  compatible  with  the  network  if  and 
only  if  the  directed  graph  is  strongly  connected.  An  algebraic  model  of  linear  computa- 
tions on  Cayley  networks  is  constructed.  It  is  shown  that  the  set  of  linear  transforma- 
tions which  are  translation  invariant  with  respect  to  a  Cayley  network  constructed  from 
a  finite  group,  G,  is  isomorphic  to  the  group  algebra  of  G  over  the  complex  numbers. 
Parallel  algorithms  for  implementing  discrete  Fourier  transforms  of  arbitrary  size  on 
mesh-connected  arrays  are  derived  using  two  different  methods.  One  method  is  based  on 
matrix  algebra  associated  with  fast  Fourier  transforms.  The  other  method  is  based  on  a 
numerical  technique  for  factoring  square  matrices  into  products  of  tridiagonal  matrices. 
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INTRODUCTION 

This  dissertation  is  the  result  of  work  done  in  conjunction  with  a  research  project 
concerning  the  development  of  an  image  algebra,  an  algebraic  structure  for  use  in  digital 
image  processing.  A  description  of  the  use  of  image  algebra  as  a  model  and  tool  for  the 
development  of  local,  parallel  algorithms  is  presented.  Specifically,  we  use  image  algebra 
as  a  model  for  local  computation  of  all  linear  transforms  and  we  develop  algorithms  for 
the  local  computation  of  the  discrete  Fourier  transform  (DFT). 

An  image  algebra  is  a  heterogeneous  algebraic  structure  whose  operands  are  those 
objects  commonly  encountered  in  digital  image  processing  and  whose  operators  reflect 
the  types  of  transformations  commonly  used  in  digital  image  processing  [5].  There  are 
two  major  reasons  for  developing  an  image  algebra;  an  image  algebra  can  provide  a  stan- 
dard mathematical  structure  for  expressing  and  investigating  problems  in  digital  image 
processing  and  can  serve  as  the  basis  for  an  algebraically  based,  high  level  programming 
language.  It  is  natural  to  use  an  algebraic  structure  as  the  basis  for  a  programming 
language.  There  is  beginning  to  be  an  awareness,  however,  not  only  in  the  image  process- 
ing community,  that  traditional  algebraic  structures  do  not  accurately  reflect  the 
representation  and  manipulation  of  operands  as  performed  by  digital  computers.  Thus, 
research  has  been  initiated  that  focuses  on  developing  new  algebraic  systems  that  accu- 
rately model  the  "environment"  of  the  digital  computer  [7,13,33,45,50]. 

Upon  reflecting  on  the  history  of  mathematics,  one  sees  that  many  of  the  advances 
and  increases  in  understanding  of  mathematics  and  its  applications  were  not  possible 


without  the  development  of  good  notational  systems.  Thus,  the  possibility  that  this 
type  of  research  will  result  in  increased  understanding  of  the  mathematical  problems  in 
digital  image  processing  is  supported  by  historical  precedent. 

Background  of  Image  Algebra 

The  use  of  image  algebra  in  digital  image  processing  was  initiated,  apparently 
independently,  by  Serra  [52],  Miller  [32,  33],  and  Sternberg  [55,  56,  58].  Ritter  showed 
that  the  algebras  used  by  these  researchers  were  all  equivalent  and  based  on  the  opera- 
tions of  Minkowski  addition  and  subtraction  of  sets  in  Rn   [44].    These  operations  are 

defined  by 

A  +  B  =  {a  +  b:a€A,b6B} 
and 


A-B  =  A  +  B 

where  A  and  B  are  subsets  of  Rn  and  the  bar  represents  set  complementation  [2].    The 

Minkowski  operations  are  sometimes  referred  to  as  the  opening  and  closing  or  erosion 
and  dilation  operations. 

A  surprising  number  of  techniques  for  extracting  information  from  digital  images 
can  be  developed  using  these  operations.  Many  of  these  techniques  can  be  found  in  the 
references  already  cited.  Indeed,  Sternberg  designed  and  built  a  special  purpose  com- 
puter, called  the  Cytocomputer,  to  implement  the  Minkowski  operations  [30,  57]. 

The  Minkowski  operations  are  limited  in  scope  to  such  a  degree,  however,  that  they 
cannot  be  used  as  the  basis  of  a  general  purpose  algebraically  based  language  for  digital 
image  processing.  Many  common  image  processing  techniques  cannot  be  expressed  using 
them  [33].    Aware  of  a  desire  within  the  image  processing  community  for  an  algebraic 


structure  that  was  capable  of  expressing  most,  if  not  all,  of  the  common  image  process- 
ing algorithms,  Ritter  began  the  development  of  a  more  general  image  algebra 
[7,  44,  45,  46].  The  research  in  this  dissertation  grew  out  of  this  effort  to  develop  such 
an  algebraic  structure. 

To  insure  that  the  image  algebra  developed  be  adequate  for  the  intended  uses,  the 
following  goals  (among  others)  were  set: 

1.)  Define  a  complete  algebra,  that  is,  one  that  is  capable  of  expressing  most,  if  not 
all,  of  the  techniques  being  used  in  digital  image  processing. 

2.)  Define  a  simple  algebra.  The  notation  should  be  easy  to  read  and  understand  by 
programmers  who  may  not  have  an  extensive  mathematical  background.  There  should 
be  a  small  number  of  operators  and  they  should  reflect  the  structure  of  the  computa- 
tions. 

3.)  Develop  basic  algebraic  properties  and  relations  within  the  algebra. 

4.)  Determine  relationships  between  the  image  algebra  and  existing  algebraic  struc- 
tures. 

5.)  Demonstrate  the  applicability  of  the  image  algebra  to  the  area  of  parallel  image 
processing. 

In  this  dissertation,  we  mainly  address  goals  4.)  and  5.).  We  remark  that  the 
development  of  an  image  algebra  is  an  ongoing  process  at  this  time. 

We  now  present  some  background  information  about  the  area  of  parallel  computer 
architectures  in  image  processing. 


Parallel  Image  Processing 

Computer  processing  of  digital  images  requires  enormous  amounts  of  computation. 
Moreover,  in  many  cases  the  computations  are  "local"  and  translation  invariant.  Thus, 
the  use  of  parallel  computers  in  this  area  is  highly  desirable  [13]. 

A  digital  image  can  be  thought  of  as  a  function  defined  on  the  discrete  rectangle 

X  =  {  (i,j)  :  0  <  i  <  m-1,  0  <  j  <  n-1  }. 
An  image  to  image  transformation  is  a  mapping  with  domain  and  range  contained  in  the 

set  of  all  such  functions.    For  many  useful  image  to  image  transformations,  it  happens 

that  the  new  value  at  a  point  x£X  depends  only  on  the  values  in  a  neighborhood  of  the 

point.    Typical  neighborhoods  are  the  von  Neumann  and  Moore  neighborhoods  depicted 

in  Figures  1  and  2. 
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X 
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Figure  1.  The  von  Neumann  neighborhood. 
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Figure  2.  The  Moore  neighborhood. 


Due  to  advances  in  VLSI  (Very  Large  Scale  Integrated  Circuit)  technology,  it  has 
become  feasible  to  build  large  rectangular  arrays  of  simple  processors.  The  processors  in 
these  arrays  each  have  their  own  small  memory  and  can  access  the  memory  of  the  pro- 
cessors in  their  neighborhoods,  typically  von  Neumann  or  Moore  neighborhoods.  Thus, 
these  processors  can  compute  functions  of  the  values  in  their  neighborhoods.  An  image 
to  image  transformation  of  this  type  is  called  a  local  transformation.  Some  of  the  arrays 
that  have  been  built  according  to  this  design  are  the  Massively  Parallel  Processor 
(MPP)  [13,  41,  60],  the  Distributed  Array  Processor  (ICL  DAP)  [20,  36],  the  Geometric 
Arithmetic  Parallel  Processor  (GAPP)  [8,  54],  and  the  CLIP4  [12,  13].  These  arrays  of 
processors  are  generally  referred  to  as  massively  parallel  processors,  cellular  array  proces- 
sors, or  mesh-connected  arrays.  The  term  cellular  array  is  due  to  the  fact  that  the  design 
of  mesh-connected  arrays  is  based  on  the  concept  of  a  cellular  automaton  [35,  65]. 

There  are  other  types  of  parallel  computer  architectures  in  use  in  digital  image  pro- 
cessing. The  Cytocomputer,  mentioned  previously,  is  an  example  of  a  pipeline  computer. 
The  image  is  fed  through  a  pipeline  of  processors,  each  processor  being  connected  to  a 
certain  number  of  the  others  (its  neighbors)  and  computing  some  function  of  the  values 
in  its  neighborhood.  A  computer  of  this  type  implements  translation  invariant  transfor- 
mations; that  is,  the  same  transformation  is  applied  to  each  point,  of  the  image. 

A  systolic  array  is  a  system  of  processors  designed  to  perform  a  specific  function  or 
class  of  functions,  such  as  matrix  multiplication  or  convolutions.  The  simplest  descrip- 
tion is  that  data  are  fed  into  the  array  in  a  specified  format,  "flow"  through  the  system, 
and  the  answer  comes  out.  However,  there  are  varying  degrees  of  flexibility  in  systolic 
arrays  [14]. 
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Many  other  architectures  have  been  proposed  for  image  processing,  some  different 
from  those  described  above  and  some  variations  of  the  previously  described  architectures 
[13,  49,  64].  A  key  feature  of  most  of  these  architectures  is  that  they  have  a  large 
number  of  processors  that  communicate  directly  with  only  a  small  subset  of  the  others. 
Many  of  them  compute  only  translation  invariant  transformations  and  some  of  them 
have  directional  flows  associated  with  the  data. 

There  is  a  desire  within  the  image  processing  community  to  use  computers  with 
parallel  architectures  as  general  purpose  image  processing  computers.  The  heavy  use  of 
local  transformations  and  the  large  computational  burden  make  such  architectures 
attractive.  However,  a  good  deal  of  research  needs  to  be  done  to  develop  methods  for 
using  these  architectures  efficiently  [43].  In  particular,  it  has  been  pointed  out  by 
Schwartz  that  the  lack  of  efficient  methods  for  computing  global  linear  transforms  is  an 
obstacle  to  the  goal  of  developing  real-time  algorithms  for  scene  analysis  by  robots  [51]. 

Summary  and  Background  of  Results  Obtained. 

We  outline  the  research  described  in  this  dissertation  and  the  results  that  we  have 
obtained.  We  also  present  some  background  information  related  to  the  specific  results. 

The  major  results  of  this  dissertation  are  as  follows: 

1.)  The  establishment  of  necessary  and  sufficient  conditions  on  the  configuration 
structure  of  a  network  for  the  existence  of  local  decompositions  of  all  linear  transforms. 

2.)  The  definition  and  generalization  of  circulant  templates.  Establishment  of  rela- 
tionships between  circulant  templates  and  other  algebraic  objects  and  application  of 
these  relationships  to  parallel  computation  of  convolutions.  We  show  that  the  set  of  all 
templates  that  are  translation  invariant  with  respect  to  a  Cayley  network  constructed 


using  a  group  G  is  isomorphic  to  the  group  algebra  of  G  over  the  complex  numbers.  We 
establish  an  equivalence  between  the  invertibility  of  these  templates  and  the  invertibility 
of  the  discrete  Radon  transform  on  finite  groups. 

3.)  The  development  of  local  decompositions  of  discrete  Fourier  transforms  using 
matrix  algebra  associated  with  fast  Fourier  transforms  (FFTs). 

4.)  Derivation  of  a  numerical  technique  for  factoring  square  matrices  into  products 
of  tridiagonal  matrices.  Establishment  of  necessary  and  sufficient  conditions  for  the  tech- 
nique to  be  successful.  Development  of  local  decompositions  of  discrete  Fourier 
transforms  using  the  technique. 

5.)  Implementation  of  parallel  algorithms  for  computing  DFTs  locally  using  an 
image  algebra  preprocessor. 

In  Chapter  1,  we  define  an  image  algebra  and  describe  relationships  between  image 
algebra  and  linear  algebra.  In  Chapter  2,  we  develop  necessary  and  sufficient  conditions 
on  the  configuration  structure  of  networks  of  processors  to  ensure  that  every  linear 
transform  has  a  local  decomposition  with  respect  to  the  array.  In  Chapter  3,  we  define 
translation  invariant  and  circulant  templates  and  develop  basic  properties  and  relation- 
ships related  to  them.  We  generalize  these  notions  to  include  the  notion  of  templates 
that  are  translation  invariant  with  respect  to  Cayley  networks.  In  Chapter  4,  we  use 
matrix  algebra  associated  with  FFTs  to  develop  local  decompositions  of  discrete  Fourier 
transforms  of  arbitrary  size.  These  decompositions  yield  algorithms  for  implementing 
FFTs  locally  with  respect  to  mesh-connected  arrays.  We  provide  a  table  which  shows 
the  estimated  number  of  arithmetic  and  data  shuffling  steps  required  to  implement  these 
algorithms  in  certain  cases.  In  Chapter  5,  we  use  numerical  linear  algebra  to  derive  an 
algorithm    for    computing    tridiagonal    decompositions    of    Fourier    transforms.     These 
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decompositions  also  yield  algorithms  for  implementing  FFTs  locally  with  respect  to 
mesh-connected  arrays.  In  the  Appendix,  we  present  computer  programs  which  are 
related  to  these  decompositions.  In  particular,  two  of  the  programs  are  implementations 
of  the  100  x  100  DFT  using  the  local  algorithms  developed  in  Chapters  4  and  5.  These 
programs,  if  implemented  on  a  mesh-connected  array  having  one  processor  per  pixel, 
would  take,  at  most,  18  parallel  multiplication  steps,  52  parallel  addition  steps,  and  690 
parallel  permutation  steps  or  36  parallel  multiplication  steps,  36  parallel  addition  steps, 
and  654  parallel  permutation  steps.  By  a  parallel  permutation  step,  we  mean  switching 
the  data  in  horizontally  or  vertically  adjacent  processors.  These  two  programs  are  writ- 
ten in  an  extended  version  of  FORTRAN  77  which  accepts  image  algebra  operands  and 
operations  and  are  therefore  written  in  a  parallel  fashion  although  they  were  run  on  a 
serial  machine. 

The  results  in  Chapter  2  are  generalizations  of  theorems  proven  by  Tchuente  [61]. 
The  results  in  Chapter  3  were  influenced  by  various  sources.  Translation  invariant  tem- 
plates occur  naturally  in  digital  image  processing.  It  is  common  practice  to  use  circular 
convolutions  to  implement  translation  invariant  transformations  since  circular  convolu- 
tions can  be  computed  using  fast  transform  methods.  Circulant  templates  are  used  to 
implement  these  convolutions.  The  generalizations  of  circulants  grew  out  of  an  aware- 
ness of  the  possible  uses  of  Cayley  networks  as  models  for  parallel  computer  architec- 
tures and  the  importance  of  translation  invariant  transformations  in  parallel  processing; 
these  generalizations  were  also  influenced  by  observations  that  the  discrete  Fourier 
transform  is  related  to  the  theory  of  group  representations  [3]. 

The  material  in  Chapter  4  has  an  extensive  background,  since  use  is  made  of  FFTs. 
Since  we  use  matrix  representations  of  FFTs,  we  relate  this  aspect  of  their  history  only. 


Use  of  matrix  algebra  to  describe  FFTs  seems  to  have  originated  with  Good  [16],  who 
used  them  to  demonstrate  the  computational  effectiveness  of  an  FFT  algorithm  seven 
years  before  the  famous  paper  of  Cooley  and  Tukey  [9].  Pease  derived  tridiagonal 
decompositions  of  power-of-two  Fourier  matrices  and  pointed  out  their  value  in  parallel 
processing  [40].  Others  have  used  matrix  algebra  to  prove  the  correctness  of  variations 
of  FFT  algorithms  [23,  63].  Good  also  used  matrix  algebra  to  compare  different  FFT 
algorithms  [17].  Rose  [48]  assimilated  the  various  matrix  representations  and  developed 
a  number  of  matrix  identities  representing  various  techniques  for  implementing  FFTs  on 
serial  processors.  In  a  related  paper,  Parlett  [39]  shows  that  the  Winograd  FFTs  [67]  can 
be  represented  as  eigenvalue-eigenvector  decompositions  of  circulant  matrices.  Radix- 
two  FFTs  have  been  implemented  on  mesh-connected  arrays  by  Jesshope  [21]  and  Strong 
[60].  These  implementations  are  based  on  the  decompositions  developed  by  Pease.  Due 
to  image  sensor  characteristics,  however,  images  may  be  digitized  in  a  variety  of  dimen- 
sions, such  as  120  x  360  [47].  Furthermore,  architectural  considerations  sometimes  result 
in  mesh-connected  arrays  being  built  with  dimensions  other  than  powers  of  two.  Exam- 
ples of  such  arrays  are  those  built  using  GAPP  chips  which  have  dimensions  that  are 
multiples  of  six  [8,  54].  Moreover,  even  if  it  were  possible  on  a  mesh-connected  array, 
experts  in  the  field  warn  against  enlarging  the  data  set  by  appending  enough  zeroes  to 
force  power  of  two  dimensions  [18].  Hence  there  is  a  need  for  methods  of  implementing 
arbitrary  discrete  Fourier  transforms  locally,  which  we  provide. 

In  Chapter  5,  we  develop  alternative  methods  of  computing  local  decompositions  of 
Fourier  matrices.  The  methods  are  numerical  in  nature  and  are  (seemingly)  unrelated  to 
FFTs.  They  are  based  on  some  theoretical  work  done  by  Tchuente  concerning  factoring 
matrices  into  products  of  tridiagonal  matrices  [61].  Tchuente  is  concerned  with  the  gen- 
eral problem  of  calculating  linear  transforms  in  parallel  and  makes  no  mention  of  any 
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specific  transforms  nor  does  he  address  implementation  issues  in  much  detail.  We  apply 
a  special  case  of  his  general  method  to  the  Fourier  matrices.  We  develop  necessary  and 
sufficient  conditions  for  the  method  to  work  and  use  them  to  show  that  it  can  be  applied 
to  the  Fourier  matrices. 

We  conclude  this  introduction  by  pointing  out  that  we  expect  this  dissertation  to 
be  read  and  used  by  members  of  a  multidisciplinary  research  team.  We  have  written  it 
with  this  in  mind  and  therefore  have  attempted  to  include  as  much  detail  as  possible  to 
keep  the  results  accessible  to  nonmathematicians.  We  hope  that  we  have  succeeded  and 
that  this  work  will  be  useful  to  others. 


PARTI 

TEMPLATE  DECOMPOSITIONS  AND  THE  LINEAR  SUBALGEBRA 
OF  AN  IMAGE  ALGEBRA 


In  Part  I  of  this  dissertation  we  consider  some  general  questions  concerning  the 
linear  subalgebra  of  an  image  algebra  and  the  applications  of  this,  subalgebra  as  a 
mathematical  model  of  parallel  image  processing. 

Part  I  is  divided  into  three  chapters.  In  the  first  chapter,  we  define  the  operators 
and  operands  of  image  algebra  and  we  describe  the  relationship  between  image  algebra 
and  linear  algebra.  In  the  second  chapter,  we  show  how  the  image  algebra  can  be  used 
as  a  mathematical  model  of  parallel  computer  architectures.  We  pose  the  question  of 
whether  all  linear  transforms  can  be  computed  on  a  given  architecture  using  the  notion 
of  local  template  decompositions.  We  develop  necessary  and  sufficient  conditions  for  the 
existence  of  local  decompositions.  In  the  third  chapter,  we  examine  two  important 
classes  of  templates,  translation  invariant  and  circulant  templates.  These  templates  are 
used  to  implement  convolutions  in  the  image  algebra  and  occur  frequently  in  digital 
image  processing.  They  are  closely  related  to  the  discrete  Fourier  transform.  In  fact,  any 
local  decomposition  of  the  Fourier  transform  yields  a  local  decomposition  of  a  circulant 
template.  Local  decompositions  of  the  discrete  Fourier  transform  are  the  subject  of  Part 
II  of  this  dissertation.  We  generalize  the  notion  of  a  circulant  template  by  defining  a 
larger  class  of  templates  which  we  call  G-templates.  We  discuss  possible  applications  of 
G-templates  to  parallel  image  processing. 
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CHAPTER  1 
IMAGE  ALGEBRA 


In  this  chapter,  we  present  the  basic  definitions  and  properties  of  an  image  algebra. 
We  also  describe  the  relationships  between  image  algebra  and  linear  algebra.  The  nota- 
tion and  relationships  introduced  here  will  be  used  throughout  this  dissertation. 

1.1.  Basic  Definitions  and  Notation 

An  image  algebra  consists  of  a  number  of  sets  with  various  operations  defined  on 
and  between  the  sets.  The  sets  consist  of  a  finite  coordinate  set,  X,  and  three  sets  of 
operands.  The  sets  of  operands  are  a  field  F  (either  the  real  or  complex  numbers),  the 
set  of  graphs  of  all  functions  a  :  X  — ►  F  which  we  call  images,  denoted  Fx,  and  a  set  of 
objects  called  templates  which  are  used  to  transform  images.  We  consider  F  to  be 
endowed  with  the  usual  field  operations  as  well  as  the  lattice  operations  of  maximum 
and  minimum.  Operations  on  F  are  defined  pointwise  using  operations  on  F.  Opera- 
tions are  defined  between  templates  and  images.  A  template  is  defined  relative  to  two 
coordinate  sets,  X  and  Y,  and  induces  three  different  mappings  from  Fx  to  FY,  one  for 
each  operation  between  templates  and  images.  Operations  are  also  defined  between  tem- 
plates, some  pointwise  and  some  as  generalizations  of  convolutions.  The  definitions  of 
these  operations  are  designed  to  reflect  the  types  of  computations  and  computing 
environments  which  are  currently  encountered,  or  likely  to  be  encountered  in  the  near 
future,  in  digital  image  processing.  We  now  give  the  formal  definitions. 
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1.1.1.  Operands 

Throughout  this  dissertation  C,  R,  and  Z  shall  denote  the  sets  of  complex 
numbers,  real  numbers,  and  integers  respectively,  and,  unless  specifically  stated  other- 
wise, X  denotes  a  finite  subset  of  Zk  where  Z    denotes  the  k-fold  Cartesian  product  of  Z. 

We  make  a  linearly  ordered  field  of  C  by  defining  a  +  bi  <  c  +  di  if  and  only  if  a 
<  c  or  a  =  c  and  b  <  d.  With  this  ordering  the  maximum  and  minimum  of  two  com- 
plex numbers  can  be  defined.  We  denote  the  binary  operations  of  maximum  and 
minimum  on  the  set  of  real  or  complex  numbers  by  V  and  A  ,  respectively.  Henceforth, 
we  shall  use  F  to  denote  either  the  field  of  complex  numbers  or  the  real  numbers,  the 
complex  numbers  being  ordered  as  above  and  the  real  numbers  being  ordered  in  the 
usual  way. 

Definition  1.1.  An  image,  A,  on  X  with  values  in  F  is  the  graph  of  a  function  a  :  X  — ►  F; 
that  is,  A  =  {  (x,a(x))  :  x  E  X  }.  The  set  of  all  images  on  X  with  values  in  F  is 
denoted  by  F  .  The  set  X  is  called  the  set  of  image  coordinates,  or  coordinate  set. 
When  F  is  understood  we  say  that  A  is  an  image  on  X  or  that  A  is  an  image. 

Definition  1.2.  Let  X,  Y  be  finite  subsets  of  Z  and  Zn,  respectively.  A  template  from  Y 
to  X  is  a  pair  (T  ,  T)  where 

1.)  T  :  Y  — ►  2   ,  with  2    denoting  the  power  set  of  X,  and 

2.)  T  :  Y  -»  Fx  such  that 

T(y)  =  {  (x,ty(x))  :  ty(x)  =  0  if  x  g  T(y),  x  €  X  }  . 
That  is,  T(y)  is  the  graph  of  a  function  ty  :  X  — ►  F  whose  support  lies  in   T(y).    The 

point  y  is  called  the  center  of  the  set  T(y),  and  the  values  ty(x)  for  x  €  T(y)  are  called 
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the  weights  of  T(y). 

If  (T  ,  T)  is  a  template,  then  T  is  called  a  template  function  with  configuration  T,  and  T 
is  called  a  template  configuration,  neighborhood  configuration,  or  just  configuration,  for 
Y  on  X.  If  Y  =  X  then  (T ,  T)  is  called  a  template  on  X  and  T  a  template  or  neighbor- 
hood configuration  on  X.  Whenever  it  is  not  necessary  to  specify  T  explicitly,  we  simply 
say  that  T  is  a  template  (to  distinguish  it  from  "T  is  a  function  from  Y  to  F  ").  The  set 
of  all  templates  from  Y  to  X  will  be  denoted  by  TY|x  an^  if  Y  =  X,  then  we  define 
Tx  =  TY|X. 

We  usually  use  the  conventions  that  x,y,z  €  X,  A,B,C  are  images  which  are  the 
graphs  of  the  functions  a,b,c  :  X  — ►  F  respectively,  and  R,S,T  are  templates  with 
corresponding  gray  level  functions  rx,  sx,  and  tx. 

1,1.2.  Image  Operations 

Let  A,B  G  F  .  The  binary  operations  of  addition,  multiplication,  maximum,  and 
exponentiation  are  defined  as  follows: 

A  +  B  =  {  (x,c(x))  :  c(x)  =  a(x)  +  b(x),  x  €  X  } 

A  *  B  ={  (x,c(x))  :  c(x)  =  a(x)*b(x),  x  G  X  } 

AVB={  (x,c(x))  :  c(x)  =  a(x)  V   b(x),  x  6  X  } 

AB  =  {  (x,c(x))  :  c(x)  =  a(x)b'x'  if  a(x)  ^  0,  else  c(x)  =  0,  x  €  X  }. 

If  F  =  R,  we  restrict  the  latter  binary  operation  to  those  pairs  of  images  A,B  for  which 
a(x)b(x)  G  R  whenever  a(x)  7^  0. 

The  dot  product  of  two  images  is  defined  by 
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A-B  =   £  a(x)b(x)  . 
xex 

An  image  A  is  called  a  constant  image  if  all  its  gray  values  are  the  same;  that  is,  if 
a(x)  =  k  for  some  real  number  k  and  for  all  x  6  X.  The  constant  images  O  and  I 
defined  by  O  =  {(x,0):  x  E  X  }  and  I  =  {(x,l):  x  6  X  }  are  the  additive  and  multiplica- 
tive identities  respectively  of  F   . 

Suppose  k  £  F  and  A  is  a  constant  image  with  a(x)  =  k.  Then  we  define  B  =  BA, 
kB  =  A*B,  and  k+B  =  A  +  B.  We  note  that  exponentiation  is  defined  even  when  a(x) 
=  0.  Also,  subtraction,  division  and  minimum  can  be  defined  in  terms  of  the  basic 
operations  and  inverses.  Specifically:  A  -  B  =  A  +  (-B),  A/B  =  A*B-1,  and  A  A  B  = 
-(-A  V  -B). 

Characteristic  functions  on  images  can  be  defined  in  terms  of  the  previous  binary 
operations.  For  example,  if  A  and  B  are  images  on  X,  then  the  characteristic  function  of 
A  greater  than  B  is  given  by 

c^AJssKA-BJVO]-1*  [(A-B)VO]. 
Thus, 

c>b(A)  =  {  (x,c(x))  :  c(x)=l  if  a(x)>b(x),  else  c(x)=0  }. 
Whenever  B  is  the  constant  image  with  gray  values  equal  to  k,  it  is  customary  to  replace 
B  by  k  in  the  above  definition.    The  remaining  characteristic  functions  of  images  can  be 
defined  in  a  similar  fashion,  using  complementation  and  products. 

The  basic  unary  operations  on  Fx  are  the  functions  available  in  most  high  level 
programming  languages.  In  fact,  any  function  /  :  F  — ♦  F  induces  a  function  / :  Fx  — ►  Fx 
by/(A)  =  {(x,c(x)):c(x)=/(a(x))}\ 

1.1.3.  Image-Template  Operations 
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There  are  three  operations  between  images  and  templates  which  are  used  to 
transform  an  image.  They  are  denoted  ©,  ©,  and  El  .  These  neighborhood  operations 
transform  each  image  point  by  performing  the  basic  operation  of  addition  or  maximum 
on  a  weighted  collection  of  neighboring  image  values.  In  particular,  if  A  €  F  and 
T  €  TY>X,  then 

A©T  =  {  (y,c(y))  :  c(y)  =     £    a(x)-ty(x),  y  €  Y  } 


A©T  m  {  (y,c(y))  :  c(y)  =     V    a(x)-ty(x),  y  e  Y  } 

x€r(y) 


A0T  =  {  (y,c(y))  :  c(y)  =     V    a(x)+ty(x),  y  6  Y  }  . 

xer(y) 

Note  that  A  €  FJ  ,  while  A©T  €  F  .  Thus,  template  operations  on  images  provide  a 
tool  for  image  rotation,  zooming,  image  reduction,  masked  extraction,  and  so  on.  In  this 
dissertation,  however,  we  restrict  ourselves  to  templates  on  X. 

1.1.4.  Template  Operations 

We  first  define  the  pointwise  operations.  The  basic  operations  of  addition,  multipli- 
cation and  maximum  on  the  set  of  templates  are  induced  pointwise  by  the  corresponding 
operations  of  Fx.  In  particular,  if  (T ,  T)  and  (S ,  S)  are  elements  of  Tx,  then  the  sum, 
product,  and  maximum  of  (T ,  T)  and  (S ,  S)  are  given  by 

(2\T)  +  (S,S)  =  (tf,R) 

where 

R(x)  =  T(x)  +  S(x)       and       R{x)  =  T{x)  U  5(x), 

(7\T)*(S,S)  =  (i?,R) 
where 
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R(x)  =  T(x)*S(x)       and       R{x)  =  T{x)  U  S(x), 

and 

(7\T)  V  (S,S)  =  (R,R) 
where 

R(x)  =  T(x)  V  S(x)       and       R{x)  =  J\x)  U  S(x). 
Again,  subtraction,  division,  minimum,   and  scalar  multiplication   can   be  derived 

from  these  basic  operations  in  a  straightforward  manner.  Specifically,  T  -  S  is  defined 
by  (T  -  S)(x)  =  T(x)  -  S(x),  T/S  by  (T/S)(x)  =  T(x)/S(x),  T  A  S  by  (T  A  S)(x)  = 
T(x)  A  S(x)  and  rT  by  (rT)(x)  =  rT(x).  The  configurations  resulting  from  these  opera- 
tions are  the  same  as  the  configurations  resulting  from  the  operations  from  which  they 
were  derived. 

We      now      define      the      generalized      convolutions      between      templates.       Let 
(5,S),  (r,T)€Tx.Then 

(5 ,  S)  ©  ( T ,  T)  =  {R  ,  R)  is  defined  by 


R(x)  ■  (S©T)(x)  =  {  (z,rx(z))  :  rx(z)  =  £        tx(y)sy(z),  z  G  X  } 

ye1\x):zeS{y) 

R{x)  =     U    S(y)  ,       and        rx(z)  =  0  if  z  g  R{x)    ; 

y£T{x) 


(5  ,  S)  0   (T,T)  =  {R  ,R)is  defined  by 


R(x)^(S0T)(x)EE{(z,rx(z)):rx(z)=  V  tx(y)  +  sy(z),  z  e  X  } 

y£7T(x):zG5(y) 

R(x)  =     U    S(y)  ,       and       rx(z)  =  0  if  z  g  R(x)    ; 

yET(x) 


{S,  S)  ©  (T,T)  =  (i?,R)  is  defined  by 
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R(x)  =  (S©T)(x)  =  {(»|rx(«)):rx(z)=  V  tx(y)sy(z),  z  €  X  }  , 

ye7tx):z€S(y) 

R{x)  =     U    S(y)  ,       and       rx(z)  =  0  if  z  £  R(x)    . 

yeTfx) 

The  notation  y£71(x):z6.S(y)  means  that  y  €  T(x)  and  z  6  S(y).  Note  that  if  z  £  S(y), 
then  Sj.(z)  =  0.    Hence,  in  the  case  of  ©  we  may  write 

rx(*)=     £    tx(y)sy(*). 
ye7i(x) 

This  concludes  the  brief  description  of  the  image  algebra.  It  has  been  shown  that 
this  set  of  operands  and  operators  is  sufficient  to  express  all  "interesting"  image  to  image 
transformations  [46].  In  the  rest  of  this  dissertation,  we  will  focus  on  the  linear  subalge- 
bra  of  the  image  algebra.  In  the  next  section  we  describe  this  subalgebra. 

1.2.  Image  Algebra  and  Linear  Algebra 

In  this  section,  we  describe  relationships  between  image  algebra  and  linear  algebra. 
These  relationships  will  be  used  throughout  this  dissertation. 

Definition  1.6.  Define  a  relation,  "  ,  on  Tx  by  declaring  S  T  if  and  only  if  for  every 
x,  y  e  X,  tx(y)  =  sx(y). 

The  relation,  "  ,  is  clearly  an  equivalence  relation.  Denote  by  Lx  the  set  of  equivalence 
classes  of  Tx  under  "  and  by  [T]  a  typical  element  of  Lx-  The  proof  of  the  next 
theorem  is  routine  and  therefore  we  omit  it. 

Theorem  1.7.  If  Si  ~    S2  and  T1  ~    T2,  then 

1.)  [Tx  +  Si]  =  [T2  +  S2]. 
1.)  [T1  ©  SJ  =  [To  ©  S2]. 
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3.)  A  ©  T,  =  A  ©T2. 
Hence,  we  can  use  the  operations  of  Tx  to  define  operations  on  Lx. 

Definition  1.8.  If  [T],[S]  £LX,T€    [T],  and  S  6    [S],  then  define 

1.)  [T]  +  [S]  =  [T  +  S] 

2.)  [T]  ©  [S]  =  [T  ©  S] 

3.)  A  ©  [T]  =  A  ©  T. 

If  [T]  6  Lx,  then  there  exists  a  unique  template  S  £  [T]  with  the  property  that  for 
every  R  £  [T],  S(x)  C  i?(x)  for  every  x  £  X.  S  has  the  property  that  sx(y)  =  0  if  and 
only  if  y  £  S(x).    We  say  that  such  a  template  has  minimal  configuration. 

Henceforth,  we  shall  identify  an  equivalence  class  of  templates  with  the  representa- 
tive of  that  equivalence  class  having  minimal  configuration.  By  Theorem  1.7,  the  alge- 
braic operations  remain  valid.  That  is,  if  S  and  T  are  templates  with  minimal 
configuration,  then  S  +  T  and  S  ©  T  are  also  template  with  minimal  configuration 
when  the  operations  are  considered  to  be  between  equivalence  classes.  Note  that  the 
operations  +  and  ©  on  Lx  are  not  the  same  as  the  operations  +  and  ©  on  Tx.  For 
example,  if  T  £  Tx  with  nonempty  minimal  configuration,  then  T  -  T  does  not  have 
minimal  configuration  (in  Tx)-  Computationally,  however,  Tx  and  Lx  are  essentially  the 
same  with  respect  to  ©.  From  now  on,  we  shall  refer  to  the  elements  of  Lx  as  tem- 
plates. Note  that  when  defining  a  template  T  £  Lx,  there  is  no  need  to  specify  the 
configuration  once  the  template  function  has  been  specified. 

Denote  by  O  and  E  the  templates  defined  by 
O(x)  =  {  (y,tx(y))  :  tx.(y)  =  0  for  every  x  6  X  } 
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and 


E(x)  =  {  (y,ex(y))  :  ex(x)  =  1  and  ex(y)  =  0  if  y  ^  x  }  . 
For  every  TeLx,  0  +  T  =  T  +  0  =  T.  Note  that    S,T  6  Lx  and  S  -  T  =  O  if  and 

only  if  S  =  T.  This  is  not  true  in  Tx.    The  template  E  is  an  identity  element  for  ©; 

that  is,  T  ©  E  =  E  ©  T  =  T  and  A  ©  T  =  A  for  every  T  €  Lx  and  A  6  Fx. 

Since  X  is  a  finite  set,  it  can  be  linearly  ordered.  Thus,  we  can  write  X  = 
{x0,x1,...)xn_1}.  We  define  a  mapping  v  :  Fx->Fn  by  u{K)  =  (a(x0),a(xI),  ...  .afx^))1. 
Since  f(rA+sB)  =  r^(A)+s^(B)  and  v  is  1-1  and  onto,  v  is  a  vector  space  isomorphism. 

Let  (Mn,*,+)  denote  the  ring  of  n  x  n  matrices  with  entries  from  F  under  matrix 
multiplication  and  addition.  For  any  T  £  Lx,  we  define  a  matrix  M-r  =  (m^)  where 
irijj  =  tx.(xj).  Note  that  the  i  row  of  Mj  is  i/(T(xi))t.  Define  a  mapping  ty  :  Lx  — ►  Mn 
by  *(T)  =  MT. 

Theorem  1.9.  ^  is  a  ring  isomorphism  o/(Lx  ,©,+)  onto  (Mn,*,+).    That  is,  if  S,T  G  Lx, 
then 

1.)  *(S+T)  =  *(S)+*(T)  or  Ms+T  =  MS+MT 

2.)  ^(S©T)  =  *(T)*(S)  or  Ms@t  =  MTMS 

3.)  ^  is  1-1  and  onto 

4.)  ^(E)  =  In  where  In  denotes  the  n  x  n  identity  matrix 

5.)  (Lx,  © ,  +)  is  a  ring. 

Proof. 

1.)  Let  S,T  €  Lx.  Then 


. 
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*(S+T) 


K(S+T)(x0))1 


v((S+nXn-l)Y 


Since  z/((S+T)(Xi))  =  i/(S(xj)+T(x;))  =  t<S(xi))+i<T(xi))I  we  may  conclude  that  *(S+T) 
=  *(S)+*(T). 

2.)  Let  R  =  S  ©   T.  Then,  by  definition  of  ©,  rx.(xj)  =      £    tx.(y)sy(xj).  By 

n-l 
definition   of  matrix   multiplication  MTMS   =  (c^)   where   c^   =    £]  tx.(xk)sXk(xj).   Since 

k=o 

^x-(xk)  =  0  if  x  ££    y(xi),  we  must  have  Cy  =  rx(xj)  which  implies  that  ty(S©T)  = 
*(T)*(S). 

3.)  The  map  ty  is  clearly  onto.  Furthermore  ty  is  1-1  since  if  ^(T)  =  ^(S)  then 
ty(T-S)  =  0.  Recall  that  O  is  the  only  template  with  all  zero  gray  values  and  minimal 
configuration.  Hence  T-S  =  O  so  T  =  S. 

4.)  If  ME  =  (mjj)  then  by  definition  my  =  1  if  and  only  if  i  =  j. 

5.)  The  set  Lx  is  an  abelian  group  under  addition  since  F  is  an  abelian  group 
under  addition  and  addition  on  Lx  is  induced  pointwise  by  Fx.  Suppose  that  R,S,T 
6LX.  Then  R©(S©T)  =  vi/-i(^(R©(S©T))  =  *-1(*(S0T)**(R))  - 
*-1(^(T)**(S)**(R))  =  (R©S)©T.  It  is  also  true  that  R©(S+T)  = 
*-x(*(R©(S+T)))  =  *-1((*(S)+vI'(T))**(R))  =  R©S+R©T.  Similarly  (S+T)©R  = 
S©R+T©R. 

Q.E.D. 

Theorem  1.10.  For  every  A  E  Fx  and  T  €  Lx,  A©T  =  i/_1(*(T)i/(A)). 
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Proof. 

Let  B  =  A©T.  Then,  by  definition  of  ©,  b(x)  =     £    a(y)tx(y).  Furthermore, 

n-l 

*(TMA)  =  (c^Ci,...,^!)*  where  q  =  £  tx.(xk)a(xk).  Since  tx,(xk)  =  0  if  xk  £  Tfa),  we 

k=0 

have  that  q  =  b(x;)  which  shows  that  i/(B)  =  ^(T)^(A).  The  conclusion  follows  from 
the  fact  that  v  is  invertible. 

Q.E.D. 

Note  that  there  is  a  different  isomorphism  ^  for  every  linear  ordering  on  X.  In  the  next 
theorem,  we  describe  the  relationship  between  them. 

Denote  by  En  the  group  of  permutations  on  {  0,1,  ...  ,n-l  }. 
Definition  1.11.  Let  a  £  Sn  and  define  the  n  x  n  matrix  Pa  by 

(l     if  j  =  cr(i) 

P,  =  (pu)  where  Pij  =  jQ   otherwise     . 

Pa  is  called  a  permutation  matrix  . 

It  is  well  known  that  permutation  matrices  are  invertible  and  that  P^1  =  Pj1  = 
P^-i.  In  fact,  the  set  of  all  n  x  n  permutation  matrices  forms  a  group  which  is  iso- 
morphic to  Sn.  Note  that  if  A  =  (aSj)  is  an  n  x  n  matrix,  then  multiplication  on  the  left 
by  Pa  permutes  the  rows  of  A  by  cr_1  and  multiplication  on  the  right  by  Pi  =  P^1  per- 
mutes the  columns  of  A  by  a'1.    Hence,  we  can  write  P^AP^  =  (&a^\ar}\). 

Theorem  1.12.  The  following  two  assertions  hold  for  anyX: 

1.)  Assume  that  Xt  =  {  x0,  xh  .  .  .  ,  xn_!  }  and  X2  =  {  y0,  yb  .  .  .  ,  yn_!  }  are 
two  different  orderings  o/X.    Let  V1  :  Lx  — ►  Mn  and  ^2  :  Lx  — ►  Mn  be  defined  relative  to 
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Xj  and  X2  respectively.  There  exists  a  permutation  matrix,  P,  such  that  for  every  T 
€  LX)  *j(T)  =  P*2(T)Pl. 

2.)  Conversely,  assume  that  ^1  :  Lx  — ►  Mn  and  that  there  exists  an  n  x  n  permuta- 
tion matrix  P  with  the  property  that  for  every  T  6  Lx  there  exists  a  matrix  M(T)  £  Mn 
such  that  ^i(T)  =  PM(T)P  .  Then  there  exists  an  ordering  on  X  such  that  if 
ty2  :  Lx  — ►  Mn  is  defined  relative  to  that  ordering,  then  ^(T)  =  M(T)  for  every  T  £  Lx- 

Proof. 

1.)  Let  ff£  En  be  the  permutation  defined  by  xa^  =  y;  for  i  £  [0,n-l].  Denote 
*i(T)  =  fog)  and  *2(T)  =  (7ij).  Then  P^T^  =  fo^j^j).  Furthermore,  7ij  = 
Vi^j)  =  ^))  =  /Mi),«x(j),  which  implies  that  %{T)  =  P^TJP,. 

2.)  Assume  that  X  =  {  x0,  x1?  .  .  .  ,  xn_j  }  is  the  ordering  on  X  used  to  define  *2. 
Since  P  is  a  permutation  matrix,  there  exists  a  a  £  Sn  such  that  P  =  Pa.  Define  a 
different  ordering  {  y0,  y1(  .  .  .  ,  y^  }  of  X  by  taking  y^  =  xs  or  y,  =  X(T_1(i).  Let 
^2  :  Lx  -+  Mn  be  defined  relative  to  this  ordering.  By  1.),  P'^TJP  =  *2(T)  for  every  T 
6  Lx.    By  assumption,  P^^TJP  =  M(T)  for  every  T  6  Lx.    Hence  *2(T)  =  M(T). 

Q.E.D. 

Henceforth,  we  shall  use  the  symbol  V  to  denote  a  mapping  ^  :  Lx  — ►  Mn  as  described  in 
this  section.  The  ordering  on  X  will  be  understood  as  given  and  no  mention  of  it  will  be 
made  in  general. 

The  special  case  that  X  is  an  m  x  n  array  is  of  particular  interest.  In  this  case,  we 
take  X  =  {  (x,y)  :  0  <  x  <  m-1,  0  <  y  <  n-1  }.  We  can  order  X  lexicographically  in 
row  major  form  by  mapping  (x,y)  -+  xn+y.  The  mappings  v  and  *  then  have  the  fol- 
lowing form: 
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i/(A)  =  {&o,^i,-;^am-if  where   a.;  =  a(x,y)  if  i=xn+y  and 


KT(0,0))1 


*(T) 


^Tfm-l.n-l))' 

Henceforth,  whenever  X  is  an  m  x  n  array,  we  shall  assume  that  these  particular  map- 
pings have  been  used. 

It  is  clear  from  the  results  of  this  section  that  any  relationship  which  can  be 
expressed  using  the  usual  notation  of  finite-dimensional  linear  algebra  can  also  be 
expressed  using  the  notation  of  image  algebra.  The  image  algebra  notation  differs  from 
conventional  matrix-vector  notation  in  that  it  reflects  the  way  that  computations  are 
carried  out  in  digital  image  processing.  There  is  a  great  deal  of  knowledge  expressed  in 
terms  of  matrices  and  vectors.  The  relationships  described  in  this  section  provide  the 
link  to  this  knowledge  and  offer  direct  methods  for  implementing  linear  image 
transforms  on  parallel  computer  architectures. 


CHAPTER  2 
TEMPLATE  DECOMPOSITIONS 


As  mentioned  in  the  Introduction,  in  order  to  take  advantage  of  the  capabilities  of 
computers  with  parallel  or  distributed  architectures,  there  is  a  need  to  understand  the 
process  of  computing  global  linear  transforms  locally.  In  this  chapter,  we  establish 
necessary  and  sufficient  conditions  for  the  existence  of  local  decompositions  of  all  linear 
transformations  with  respect  to  a  directed  network  of  processors.  One  particular  implica- 
tion of  this  theorem  is  that  any  linear  transformation  can  be  factored  into  a  sequence  of 
linear  transformations,  each  of  which  is  local  with  respect  to  a  mesh-connected  array. 
The  theorem  is  much  more  general,  however,  implying  that  any  linear  transformation 
can  be  factored  into  a  sequence  of  linear  transformations,  each  of  which  is  local  with 
respect  to  the  interconnection  structure  of  a  particular  network  of  processors,  as  long  as 
any  processor  in  the  network  can  communicate  with  any  other  processor  in  the  network, 
that  is,  as  long  as  there  is  a  path  through  the  network  between  every  pair  of  processors. 
The  theorem  is  an  existence  theorem  and  does  not  yield  an  efficient  technique  for  com- 
puting such  decompositions,  although  a  constructive  proof  is  given.  The  value  of  such  a 
theorem  is  that  it  shows  that  such  decompositions  exist  and,  therefore,  that  it  is  not 
futile  to  attempt  to  develop  methods  for  computing  them. 

Template  configurations  are  used  here  to  model  processor  arrays.  We  set  up 
correspondences  between  template  configurations,  directed  graphs,  and  networks  of  pro- 
cessors. A  computation  is  considered  to  be  compatible  with  a  network  if  the  templates 
required    to   implement   the   computation   are   local   with   respect   to  the   configuration 
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modeling  the  network.  We  then  consider  the  problem  of  factoring  templates  into  pro- 
ducts (with  respect  to  ©)  of  templates  which  are  local  with  respect  to  a  given 
configuration.  We  show  that  every  template  has  such  a  local  decomposition  if  and  only 
if  the  directed  graph  corresponding  to  the  configuration  is  strongly  connected.  This  is 
the  major  result  of  this  chapter  and  is  a  generalization  of  a  theorem  proven  by  Tchuente 
which  we  will  state  using  the  notation  provided  by  image  algebra  [61]. 

2.1.  Definitions  and  Background 

Throughout  this  chapter,  let  X  C  Zk  be  a  finite  set,  R  :  X  — ►  2X  an  arbitrary  tem- 
plate configuration  with  the  property  that  x  G  R{x)  for  every  x  6  X,  and  assume  that 
images  have  values  in  C.  We  think  of  the  elements  of  X  as  being  processors  in  a  net- 
work. The  assumption  that  x  6  R{x)  can  be  interpreted  as  meaning  that  every  proces- 
sor in  the  network  has  direct  access  to  its  own  memory.  Similarly,  the  interpretation  of 
y  £  R(x)  is  that  processor  x  has  direct  access  to  the  memory  of  processor  y. 

Definition  2.1.  Let  T  e  Lx.  A  template  decomposition  of  T  is  a  set  {TJjLj  of  templates 

n 

Tj  €  Lx  such  that  T  =  ©Tj.    The  Ts  are  called  the  factors  of  the  decomposition.    We 

n 

write  T  =  ©Tj  is  a  decomposition  of  T. 


1=1 


Definition  2.2.  We  say  that  a  template  T  e  Lx  is  local  with  respect  to  R  if  and  only  if 
T{x)  C  R{x)  for  every  x  G  X.  If  R  is  understood,  then  we  say  that  T  is  a  local  tem- 
plate. 

Definition  2.3.  Let  T  6  Lx  and  assume  that  {TJiLi  is  a  template  decomposition  of  T. 
We  say  that  {Ti}}!,  is  a  local  decomposition  of  T  with  respect  to  R  if  and  only  if  each 
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Ti  is  local  with  respect  to  R  for  i  £  [l,n].  If  T  6  Lx  and  if  there  exists  a  decomposition 
{Tj}jLi  of  T  which  is  local  with  respect  to  R,  then  we  say  that  T  has  a  local  decomposi- 
tion with  respect  to  R.  If  R  is  understood,  then  we  say  that  T  has  a  local  decomposi- 
tion. 

Definition  2.4.  A  digraph,  or  directed  graph,  is  a  pair  D  =  (V,E)  where  V  is  a  finite  set 
and  E  C  V  x  V.  The  elements  of  V  are  called  vertices.  If  (x,y)  £  E,  then  (x,y)  is  called 
an  arc  from  x  to  y  or  just  an  arc.  If  there  is  a  need,  we  may  write  V  =  V(D)  to 
emphasize  that  V  is  the  vertex  set  for  the  digraph  D;  we  may  also  write  E  =  E(D). 

Definition  2.5.  A  graph  is  a  digraph  G  =  (V,E)  with  the  additional  property  that  if  (x,y) 
6  E,  then  (y,x)  £  E.  If  (x,y)  £  E,  then  we  say  that  xy,  or  equivalently  yx,  is  an  edge  of 
G. 

Definition  2.6.  Let  D  =  (V,E)  be  a  digraph  and  let  u,v  £  V.  A  u-v  walk  (in  D),  or  a  walk 
from  u  to  v,  is  a  finite  sequence  of  vertices  u  =  w0,  w1;  .  .  .  ,  wn_1,  wn  =  v  with  the  pro- 
perty that  (wi;wi+1)  £  E  for  every  i  £  [0,n-l].  A  closed  u-v  walk  is  a  u-v  walk  with  the 
property  that  u  =  v. 

Definition  2.7.  Let  D  =  (V,E)  be  a  digraph  and  let  u,v  £  V.  A  u-v  path  (in  D),  or  a  path 
from  u  to  v,  is  a  u-v  walk  with  distinct  vertices,  except  possibly  u  and  v.  A  closed  u-v 
path  is  a  u-v  path  with  the  property  that  u  =  v. 

Remark  2.8.  We  sometimes  name  a  u-v  walk  P  by  using  the  symbolism 

P  :  u  =  w0,  wb  .  .  .  ,   wn_1;  wn  =  v. 
We  say  that  w;  6  P  or  that  Wj  is  on  P. 
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Remark  2.9.  If  G  =  (V,E)  is  a  graph  and  u,v  £  V,  then  if  there  exists  a  u-v  path  in  G, 
then  there  exists  a  v-u  path  in  G. 

Definition  2.10.  Let  D  =  (V,E)  be  a  digraph  and  let  u,v  £  V.  We  say  that  v  is  reachable 
(in  D)  from  u  if  there  exists  a  path  from  u  to  v  (in  D).  If  u  is  reachable  from  v  and  v  is 
reachable  from  u,  then  we  say  that  the  pair  (u,v),  or  (v,u),  is  mutually  reachable  (in  D). 

Remark  2.11.  If  G  is  a  graph,  then  reachable  and  mutually  reachable  are  equivalent 
notions. 

Definition  2.12.  Let  D  =  (V,E)  be  a  digraph.  We  say  that  D  is  strongly  connected  if  and 
only  if  every  pair  of  vertices  (u,v)  £  V  x  V  is  mutually  reachable.  If  D  is  a  strongly  con- 
nected graph,  then  we  say  that  D  is  connected. 

We  now  set  up  the  correspondence  between  template  configurations  and  directed 
graphs. 

Definition   2.13.   For  every  x  £  X,   let  Ex   =    {  (y,x)  :  y  £  R{x)  }.     The   digraph  of  R, 

denoted  D(R),  is  the  digraph  D{R)  =  (X,E)  where  E  =    U  Ex. 

xex 

Definition  2.14.  We  say  that  R  is  symmetric  if  and  only  if  for  every  x,y  £  X,  x  £  R(y) 
implies  that  y  £  R(x). 

Theorem  2.15.  D(/?)  is  a  graph  if  and  only  if  R  is  symmetric. 

Proof. 

Assume  that  D(R)  =  (X,E)  is  a  graph.  Let  x,y  £  X  with  x  £  R(y).  By  definition, 
(x,y)  £  E.  Since  D(R)  is  a  graph  this  implies  that  (y,x)  £  E.  Thus,  y  £  R(x)  so  R  is 
symmetric. 
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Conversely  assume  that  R  is  symmetric  and  that  (x,y)  £  E.  Then  x  6  R[y)  and 
therefore,  since  R  is  symmetric,  y  €  R(x).  Thus,  (y,x)  6  E  which  implies  that  D(R)  is  a 
graph. 

Q.E.D. 

If  R  is  symmetric,  then  we  write  G(R)  for  the  graph  of  R. 

We  now  state  Tchuente's  theorem  using  the  notation  of  image  algebra. 

Theorem  (Tchuente).  Let  X  C  Zk  be  a  finite  set  and  let  R  :  X  — ►  2X  be  a  symmetric 
template  configuration  with  the  property  that  x  6  R{x)  for  every  x  €  X.  Every  template 
T  6  Lx  has  a  local  decomposition  with  respect  to  R  if  and  only  ifG(R)  is  connected. 

Tchuente's  theorem  is  not  as  general  as  one  might  like.  The  assumption  that  the 
configuration  R  is  symmetric  restricts  the  type  of  multiprocessors  being  modeled  to  those 
in  which  the  data  can  flow  in  both  directions  along  any  of  the  communication  links.  In 
many  important  cases,  such  as  pipeline  and  parallel-pipeline  computers,  and  systolic 
arrays,  this  assumption  is  not  valid  [26,  30,  42]. 

We  will  prove  the  following  generalization  of  Tchuente's  theorem: 

Theorem  2.16.  Let  X  C  Zk  be  a  finite  set  and  let  R  :  X  —  2X  be  a  template 
configuration  with  the  property  that  x  £  R(x)  for  every  x  £  X.  Every  template  T  £  Lx 
has  a  local  decomposition  with  respect  to  R  if  and  only  ifD(R)  is  strongly  connected. 

Although  the  two  theorems  appear  to  be  almost  identical,  it  happens  that,  as  is 
often  true  in  graph  theory,  the  case  for  directed  graphs  is  significantly  different  than  for 
graphs  [4].  There  are  two  differences  between  graphs  and  digraphs  which  emerge  in  try- 
ing to  use  a  straightforward  generalization  of  Tchuente's  proof.  One  difference  is  that 
any    permutation     of    a    graph    can     be    accomplished    by    a    sequence    of    adjacent 


30 

transpositions.  This  is  not  necessarily  true  for  a  digraph  (consider  the  directed  cycle). 
Another  difference  is  that  on  a  connected  graph  there  is  always  at  least  one  vertex  which 
can  be  removed,  along  with  the  edges  incident  to  it,  such  that  the  resulting  graph  is  still 
connected.  This  is  not  necessarily  true  of  a  digraph  (consider  again  the  directed  cycle). 
Tchuente  relies  upon  both  of  these  facts  for  graphs. 

We  will  prove  Theorem  2.16  in  stages.  We  first  prove  that  strong  connectivity  is  a 
necessary  condition.  We  then  establish  a  sequence  of  theorems  which  result  in  showing 
that  strong  connectivity  is  sufficient  for  permutation  templates.  Finally,  we  show  how 
certain  matrix  decompositions  can  be  used  in  conjunction  with  the  results  on  permuta- 
tions to  deduce  that  strong  connectivity  is  sufficient  in  general. 

2.2.  Necessary  and  Sufficient  Conditions  for  the  Existence  of  Local  Decompositions 

We  now  show  that  strong  connectivity  is  necessary. 

Theorem  2.17.  If  every  T  £  Lx  has  a  local  decomposition  with  respect  to  R.  then  D(R)  is 
strongly  connected. 

Proof. 

Assume  by  way  of  contradiction  that  D(R)  is  not  strongly  connected  but  that  every 
T  £  Lx  has  a  local  decomposition  with  respect  to  R.  Let  x,y  £  X  such  that  x  is  not 
reachable  from  y.   Let 

Cx  =  {  u  £  X  :  x  is  reachable  from  u  } 
and 

Co  =  {  v  £  X  :  v  is  reachable  from  y  }  . 
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Note  that  CjDC^  =  <f>  since  if  z  6  C^nC^  then  z  is  reachable  from  y  and  x  is  reachable 
from  z  which  implies  that  x  is  reachable  from  y. 

Assume  that  T  is  a  local  template  with  respect  to  R.  We  show  that  tu(v)  =  0  if 
u  G  Cj  and  v  G  C2.  If  this  is  not  true,  then  there  must  be  a  u  G  C1  and  v  G  C2  such 
that  tu(v)  t^O  which  implies  that  v6H[u).  Since  T  is  local  with  respect  to  R, 
1\\i)  C  R{u)  so  v  G  R{a]  which  implies  that  u  is  reachable  from  v.  However,  x  is 
reachable  from  u  so  x  is  reachable  from  v.    Thus,  we  arrive  at  the  contradiction  v  G  Cx. 

We  now  construct  a  template  which  cannot  have  a  local  decomposition  with  respect 
to  R.    Let  T  G  Lx  be  defined  by 


(E(y)       if 
*W  -  {    Q  oth 


if  z  =  x 

erwise 


k 
By  assumption  there  exists  a  decomposition  T  =  ©  Tt  which  is  local  with  respect  to  R. 

i=l 


.X 


Let  A  G  CA  be  defined  by 


;  (z,a(z))  :  a(z)  ^      ^n     J^X  } 


I1     " 

\0    ot 


The  image  B  =  A©T  is  given  by 


B  ^  {  (z,b(z))  :  b(z)  H  n     o,herwise    }    . 


( 1     if  z  =  x 

)0     otherwise 


Let  j  G  [l,k].    We  show  that  if  Bj  =  A  ©  (©TO,  then  bj(u)  =  0  for  every  u  G  Cx. 


i=l 


Assume  that  this  is  not  true  for  B^    Then  there  exists  a  u  G  Cx  such  that  b^u)  = 

Yj    ^u   (v)a(v)  7^  0-  Therefore,  there  must  exist  a  v  G  Tx(u)  such  that  tj^v)  7^  0  and 
ver,(u) 

a(v)  7^  0  which  implies  that  v  =  y  which  in  turn  implies  that  v  G  C2.    However,  v  G  C2 

means  that  tj^v)  =  0  which  is  a  contradiction.    Hence,  the  claim  must  be  true  for  Bj. 


32 

Assume  that  for  some  j  €  [2,k]  the  claim  is  true  for  B1(  B2,  .  ■  ■  ,  Bj_x  and  that  it  is  not 
true  for  Bj.    Then   there  exists  a  u  6  Cx  such   that  bj(u)  =      £    tu(i)(v)bj-i(v)  7^  0. 

v€7j(u) 

Therefore,  there  must  exist  a  v  6  Tj(u)  such  that  tu(j)(v)  7^  0  and  bH(v)  7^  0.  By  the 
induction  hypothesis,  v  £  Cv  The  statement  that  v  £  7j(u)  implies  that  u  is  reachable 
from  v  since  Tj  is  local.  Moreover,  since  u  E  Cb  x  is  reachable  from  v  so  v  £  Cj  which 
is  a  contradiction.  Hence,  the  claim  must  be  true.  In  particular,  it  must  be  true  for  B 
=  Bk  =  A©T.  Therefore,  b(x)  =  0  which  is  another  contradiction  and  proves  the 
theorem. 

Q.E.D. 

Definition  2.18.  Let  T  6  Lx.  We  say  that  T  is  a  permutation  template  if  and  only  if  MT 
is  a  permutation  matrix  for  some  ^  :  Lx  — ►  Mn. 

Remark  2.19.  If  ^i(T)  is  a  permutation  matrix  for  some  ordering  on  X,  then  ^(T)  is 
also  a  permutation  matrix  for  any  other  ordering  on  X. 

Remark  2.20.  Recall  that  the  set  of  all  n  x  n  permutation  matrices  is  isomorphic  to  the 
symmetric  group  Sn.  Thus,  in  particular,  since  every  permutation  can  be  factored  into  a 
product  of  transpositions,  every  permutation  template  can  be  factored  into  a  product  of 
templates  corresponding  to  transpositions.  We  identify  these  templates  in  the  next 
definition. 

Definition  2.21.  For  every  x,y  €  X,  denote  by  Txy  E  Lx  the  template  defined  by 
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Txy(z)    = 


E(x)  if  z  =  y 

E(y)  if  z  =  x     . 

E(z)  otherwise 


Txy  is  called  the  exchange  template  associated  with  (x,y),  or  simply  an  exchange  tem- 
plate. Note  that  Txy  is  the  permutation  template  corresponding  to  the  transposition 
a  e  Sx  defined  by  a  =s  (x,y).    If  A  G  Cx  and  B  =  A©T,  then 


b(z)  = 


a(x)  if  z  =  y 
a(y)  if  z  =  x 
a(z)        otherwise 


Lemma  2.22.  Let  x,y  6  X  with  x^y  and  let  Txy  be  the  exchange  template  associated 
with  (x,y).    Assume  that  there  exists  an  x-y  path 


Pi  :  x  =  uo.  ui> 
and  a  y-x  path 


i    uk-i-  uk 


?2  :  y  =  w0,  wj,  .  .  .  ,   wH,  Wj  =  x 
such  that  the  closed  x-x  walk 


P  :x 


Un,  U, 


^0>   ub  •  •  •  ,    uk-l.  uk.  wi>  •  ■  •  .    wj-l  =  x 
is  a  closed  x-x  path  in  T>(R).    The  exchange  template  Txy  has  a  local  decomposition  with 

respect  to  R. 


Proof. 


The  proof  is  by  construction.  Denote  the  closed  path  P  by 

P  :  x  =  v0,  v1;  .  .  .  ,   vn_!,  vn  =  x. 
Assume  that  y  =  Vj  where  j  6  [l,n-l]  and  denote  a;  =  a(vj).  Note  that 

Txy  =  TVoVl©TViV2©TV2Vs0  •  •  •  ©TVj  iVj©TVj_2VH©  •  •  •  ©TViV2©TVqVi 
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We  show  how  Tv  v   can  be  expressed  as  a  product  of  local  templates.    It  is  interesting  to 

note  that  the  factors  of  the  decomposition  are  not  permutation  templates.    Figure  2.1  is 
helpful  in  understanding  the  templates  which  we  are  about  to  describe. 

Define  local  templates  T11;  T12,  .  .  .  ,   Tln_!  by 


(E(v 
U    =    \ 


k)  +  E(vn]       if  z  =  vk 
E(z)  otherwise 


n-l 


for  k  6  [l,n-l].   The  image  Bl  =  A  ©  ( ©  Tj*)  is  given  by 

k=l 


k 

j=0         if  z  =  vk  for  k  6  [0,n-l] 
a(z)       otherwise 

Define  the  local  template  Tln  by 


Tm 


E(vn_1)  -  E(v0)       if  z  =  v0 
E(vk)  -  E(vk_x)       if  z  =  vk   for  k  €  [2,n-l] 
E(z)  otherwise 


The  image  B2  =  B^Tj  n  is  given  by 


b2(z) 


n-l 

Ea: 

i=l 
3fl  +  al 

biW 

a(z) 


if  z  =  v0 

if  z  =  Vj 

if  z  =  vk  for  k  ^  0,1 

otherwise 


Define  local  templates  Tln+1,  Tln+2,  .  .  .  ,  T12n_3  by 


^l.n+k    = 


|E(vk+2 


J+2)  +  E(vk+1)       if  z  =  vk+2 
E(z)  otherwise 


n-3 


for  k  E  [l,n-3].   The  image  B3  =  B2  ©  ( ©  Tx  n+k)  is  given  by 

k=l      ' 


k 

Eai 

i=2  if  z  =  vk   for  k  £  [2,n-l] 


b3(z)  = 
Define  the  local  template  T1  2n_2  by 


b2(z)       if  z  =  v0,  Vj 
a(z)        otherwise 


E(v0)  -  EKJ       if  z  =  v0 
E(vk)  -  E(vw)       if  z  =  vk   for  k  G  [3,n-l] 
E(z)  otherwise 


Tl,2n-2    =    ' 
The  image  B4  =  B3©T12n_2  is  given  by 


b4(z) 


ax  if  z  =  v0 

ao  +  aj       if  z  =  Vj 


a(z)  otherwise 

Define  the  local  template  Tj  2n_j  by 


,  _    fEfr, 

l,2n-l    == 


)  -  E(v0)       if  z  =  Vl 
E(z)  otherwise 


Tl,2n-1    = 
The  image  B5  ==  B4©T1|2n_1  is  given  by 


b8(«)  - 


ax         if  z  =  v0 

ag         if  z  =  Vj 

a(z)       otherwise 
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Hence,  B5  =  A©TVq]Vi  which  enables  us  to  conclude  that  Tv  v  =  ©  Tlti  is  a  local 
decomposition  of  TVqVi.  Similarly,  we  can  construct  local  decompositions  {T^}^1  of 
TVk_jvk  f°r  k  €  [2J-1].  We  then  have  that 


T  = 


&T™ 


© 


;=j-l       k_1  k 


j      2n-l 

©(©Tk,h) 


© 


1       2n-l 

©  (©Tk>h) 

k=j-l  h=l 


is  a  local  decomposition  of  T. 


Q.E.D. 


36 


Bi  B2  B3  B4  Bs 


a0 

ao 

n-1 
i=l 

n-1 

E*i 

i=l 

a. 

ai 

ai 

aQ-j-a^ 

ZLq+S-I 

ao+aj 

ao+ai 

a0 

a2 

Eai 
i=0 

3 

ao 

a2 

a2 

a2 

a3 

Eai 

i=0 

a:s 

a2+a3 

a:', 

a3 

n-2  n-2 

*n-2  Eai  an-2  E3!  an-2  an-2 

i=0  i=2 

n-1  n-1 

ln-l  Eai  an-l  E^  an-l  an-l 

i=0  i=2 


Figure  2.1.  Stages  in  the  local  implementation  of  Tv 


Theorem  2.23.  Let  x,y  G  X  and  /e/  Txy  be  the  exchange  template  associated  with  (x,y). 
Assume  that  D(i?)  w  strongly  connected.  The  exchange  template  Txy  /ms  a  /oca/  decom- 
position with  respect  to  R. 

Proof. 

Since  D(R)  is  strongly  connected,  every  pair  (x,y)  is  mutually  reachable  in  D(R). 
Let 

pi  :  x  =  u„,  u1(  .  .  .  ,    un_i,  un  =  y 

?2  :  y  =  v0,  Vj,  .  .  .  ,   vm_b  vm  =  x. 
be  x-y  and  y-x  paths  respectively.    P2  and  P1  can  have  only  finitely  many  vertices  in 
common  other  than  x  and  y.    If  there  are  none,   then,  by  Lemma  2.22,  we  are  done. 
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Otherwise,  label  the  common  vertices  u^,  ui2,  .  .  .  ,  uik  where  ix  <  i2  <  ■  •  •  <  ik.  For 
convenience  we  take  i0  =  0  and  ik+1  s  n.  Define  jlf  j2,  •  ■  •  ,  Jk  by  vjs  =  uis  for  s  E 
[l,k].    By  our  choice  of  the  u;8,  the  closed  walks 

W„  :  x  =  Uo,  ulf  .  .  .  ,   u^  =  vh,  vji+1,  .  .  .  ,  vm  =  x 
Wx  :  Uii,  ui]+1,  .  .  .  ,    ui2  =  vJ2,  vJ2+1,  .  .  .  ,   Vj] 

Wk  :  uik-  uik+l,  •  •  •  .    un-l,  un  =  y  =  v0,  Vj,  .  .  .  ,    Vjk 
are  all  closed  paths.  Furthermore, 

Tx,y  -  TX,U1  ©TU1],U1  ©  ■  •  •  ®TUviiUl®T%,®Tu^Ui@  ■  •  •  ©TUvUi2©Tx,Ui . 
Since  each  of  the  W:  are  closed  paths,  Lemma  2.22  implies  that  the  templates  Tu  u      for 

j  €  [0,k-l]  have  local  decompositions  with  respect  to  R.  Hence,  Txy  has  a  local  decom- 
position with  respect  to  R. 

Q.E.D. 

Corollary  2.24.  Assume  that  D(R)  is  strongly  connected  and  let  T  6  Lx  be  a  permutation 
template.  Then  T  has  a  local  decomposition  with  respect  to  R. 

Proof. 

Every  permutation  template  can  be  factored  into  a  product  of  exchange  templates. 
By  Theorem  2.23,  every  exchange  template  has  a  local  decomposition  with  respect  to  R. 

Q.E.D. 

Thus,  strong  connectivity  of  the  template  configuration  is  a  necessary  and  sufficient 
condition  for  the  existence  of  local  decompositions  of  all  permutation  templates.  We  now 
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show  that  this  is  so  in  the  general  case. 

Lemma  2.25.  Assume  that  |X  |  =  n  >  1,  D(R)  is  strongly  connected,  and  that,  for  some  j 
€  [l,n-l],  Tji,  Tj2  6  Lx  such  that  relative  to  some  ordering  o/X, 


*(Tja)  EEE  MjX  = 


and 


*(Tj2)  eee  Mj2 


" 

1 

Ij-1 

1      o 

1        1 

0 

0 

1       Xj       | 

Xj+1 

x„ 

0 

o      1 

1        1 

In-j 

Ij-1 

o 

1 

1 

0 

0 

1       Mj 
1      Pj+i 

1      o 

0 

0 

1 

In-j 

0n 

1 

The  templates  Tjj  arat/Tj2  Aave  local  decompositions  with  respect  to  R. 

Proof. 

Assume  that  X  =  {  xb  x2,  .  .  .  ,   xn  }  is  the  ordering  used  to  define  ^  :  Lx  — +  Mn. 
We  consider  Tj]  first.    Note  that  if  A  6  Cx,  then 


a(x)  if  x  7^  Xj 

bW==J    „  ifx  =  x- 

EMxk) 

k=j 
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This  can  be  seen  by  inspection  of  the  matrix  Mjx  and  recalling  the  definition  of  *  and 
v  :  Cx  -+  Cn.  Since  D(R)  is  strongly  connected  and  n  >  1,  there  exists  y  G  X  such  that 
y  7^  Xj  and  y  €  #(xj),  say  y  =  xm.  The  rough  idea  of  the  proof  is  to  first  multiply  a(xj) 
by  Xj  and  leave  everything  else  fixed.  Each  a(xk)  for  k  G  [j+l,n]  is  then  "moved"  to  the 
location  xm,  Xka(xk)  is  added  to  the  current  value  at  the  location  Xj,  and  a(xk)  is  then 
moved  back  to  location  xk.   All  these  steps  can  be  done  locally. 

We  now  define  the  templates  of  the  local  decomposition.   Define  Tj  G  Lx  by 
E(x)         if  x  7^  Xj 


/   E(x) 
Tj  is  clearly  local  and  if  B1  =  A©Tj,  then 


TiW  -)        if  x  =  Xj 


f   a(x)         if  x  7^  Xj 
bi(x)   -   \\ja(xj)       ifx  =  Xj 


For  each  i,k   G   [l,n],  let  Pik  denote  the  exchange  template  corresponding  to  (xbxk). 
Recall  that  Pik  =  Pki  =  P^1.   For  each  k  G  [j+l,n]  define  Uk,  Tk  G  Lx  by 

(  E(x)  if  x  t^  Xj 

UkW   S    \E(xj)  +  XkE(xm)       ifx=Xj     • 

and 

Tk   s   Pkm  ©  Uk  ©  Pmk   . 
Since  xm  G  i?(xj),  each  Uk  is  local  with  respect  to  R.    Furthermore,  by  Theorem  2.23, 

each  Pkm  has  a  local  decomposition  with  respect  to  R.    Therefore,  for  every  k  G  [j+l,n], 
Tk  has  a  local  decomposition  with  respect  to  R. 

If  A  G  Cx  and  B  =  A©Tk  for  some  k  G  [j+l.n],  then 


\a(xj) 


a(x)  if  x  7^  Xj 

bW   ;       1a(xj)  +  Xka(xk)       ifx  =  Xj 
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Hence,  for  every  A  €  Cx,  if  B  =  A©  (©Tk),  then 


k=j 


b(x)  =  ( 


a(x)  if  x  7^  Xj 

n  if  X   =  X; 


EXka(Xk) 
k=j 


Thus,  Tjj  =  ©  Tk  which  enables  us  to  conclude  that  Tjx  has  a  local  decomposition  with 
k=j 

respect  to  R. 

We  now  prove  the  lemma  for  Tj2.  The  concept  is  the  same;  there  are  some  points 
which  must  be  modified  to  account  for  the  transposition.  Note  that  if  A  6  C  and  B  = 
A©Tj2,  then 


b(x)  = 


a(x)  if  x  =  xk   for  k  €  [lj-1] 

//;a(x:)  if  x  =  Xj 


/xka(xj)  +  a(xk)        if  x  =  xk    for  k  £  [j+l,n] 
Define  the  local  template  Sn+1  E  Lx  by 


Sn+l(x)    = 


E(x)  if  x  7^  Xj 

/ijE(xj)       if  x  =  Xj 


Since  T>{R)  is  strongly  connected,  for  every  k  £  [j+l,n]  there  exists  an  mk  £  [l,n]  such 
that  xm   £  -R(xk)  an<^  xmk  7^  xk-    For  every  such  k  define  Vk,  Sk  £  Lx  by 


Vk(x)    - 


E(x) 
E(xk)  +  /^kE(xmk) 


and 


if  x  7^  xk 
if  x  =  xk 


Sk  =  Pjmk  ©  vk  ©  pjmk 
where  Pm^  denotes  an  exchange  template.    Each  Sk  for  k  £  [j+l,n]  has  a  local  decompo- 
sition since  each  Vk  is  local  and  the  other  factors  are  exchange  templates.    Moreover,  if 
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A  £  Cx  and  B  =  A©Sk  for  some  k  £  [j+l,n],  then 


a(x)  if  x  7^  xk 

a(xk)  +  Mka(xj)       if  x  =  xk 


b(x)  =  [ 

n  n 

Thus,  for  every  A  £  C   ,  A©T;2  =  A©  ( ©  Sk+1)  so  T;2  =  ©  Sk+1  which  enables  us  to 

k=j  k=j 

conclude  that  Tj2  has  a  local  decomposition  with  respect  to  R. 

Q.E.D. 


Lemma  2.26.  Let  M  £  Mn  with  n  >  1.  For  every  j  £  [l,n-l\  there  exists  n  x  n  permuta- 
tion matrices  Pj  and  Qj,  n  x  n  matrices  Mjj  and  M:2,  o/  the  form  given  in  Lemma  2.25, 
and  a  constant  c  £  C  such  that 


M 


IIPjMj! 

j=l 


In-! 

o      •       •    o 

0 

0 

c 

Proof. 

We  make  use  of  an  observation  made  by  Tchuente  that  there  exists  permutation 
matrices  P  and  Q,  constants  X1;  X2,  .  .  .  ,  Xn,  fih  //2,  .  .  .  ,  fin,  and  an  n-1  x  n-1  matrix 
C  such  that 


M  =  P 


\i             X2       .       Xn 

1        |        0        .        0 

Hi       |       0       .       0 

o      1 

•           1                  In-1 

o      | 

o      1 

1           o 

0        | 

M2         1 

•            1                   In-1 
^n          1 

Q 


The  proof  is  by  induction  on  n.  If  n  =  2,  then  the  statement  of  the  lemma  is  ident- 
ical to  the  observation  made  by  Tchuente.    Assume  that  n  >  3  and  that  the  theorem  is 


42 

true  for  2,3,  .  .  .  ,  n-1.  By  Tchuente's  observation  there  exists  permutation  matrices  Px 
and  Q1;  n  x  n  matrices  Mn  and  M12  of  the  form  given  in  Lemma  2.25,  and  an  n-1  x  n-1 
matrix  C  such  that 


M  =  PiMn 


'  1  |        0        .        0 

o  I 

•  I  c 

o  I 


M12Qj 


By  the  induction  hypothesis,  for  every  j  €   [l,n-2],  there  exists  permutation  matrices 
Pj,  Qj,  n  x  n  matrices  Mj1;  Mj2,  and  a  constant  c  £  F  such  that 


C  = 


npjMji 


1      o 

In-2 

1    • 

1      o 

0 

0 

c 

n2Mj2Qj 
j=i 


Defi 


ne 


Pj+i 


and,  for  j  E  [l,n-2],  define  Mj1(  Mj2,  and  Qj  in  a  similar  fashion.    Since  multiplication  of 
block  diagonal  matrices  can  be  accomplished  by  multiplying  corresponding  blocks, 


0 
0 


C 


43 


1 
0 


Hence, 


n-2 


In-1 

o      •       ■    o 

0 

0 

c 

IIPjMj! 
j=2 


M  = 


nPjMji 

j=i 


which  proves  the  lemma. 


1  0 


In-1 

o    ■       •      o 

0 

0 

c 

nMj2Qj 

j=2 


o 

In-1 

1    • 

o 

0 

0 

c 

nMj2Qj 


n2Mj2Qj 
j=i 


Q.E.D. 


We  now  show  that  strong  connectivity  is  sufficient  in  general. 

Theorem  2.27.  If  D(i?)  is  strongly  connected,   then  every  template  T  £  Lx  has  a  local 
decomposition  with  respect  to  R. 

Proof. 


Let  T  €  Lx.  We  can  write 
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Mn 


In-1 

o    •       •      o 

0 

0 

1        c 

E[Mj2Qj 

j=i 


where  the  Pj  and  Qj  are  permutation  matrices  and  the  Mj,  are  n  x  n  matrices  of  the  form 
given  in  Lemma  2.25.  For  every  j  e  [l,n-l]  and  i  E  [1,2],  let  Tjj  ss  ^(Mj;), 
Uj  =  M'-HPj),  Vj  =  ^(Qj),  and 


S   =   *"* 


1      o 

In-1 

1    • 
1    • 

1      o 

0 

0 

1        c 

By  Lemma  2.26, 

l  l 

T  =  (  ©  Vj©Tj2)  ©  S  ©  (  ©  TjiOU:)  . 

j=n-l  j=n-l 

Since   the  Vj   and   Uj   are   permutation   templates,   by   Corollary   2.24,   they   have   local 

decompositions.    By  Lemma  2.25,  for  every  j  £  [l,n-l]  and  i  E   [1,2],  Tjj  has  a  local 

decomposition.    The  template  S  is  local  since  ty(S)  is  diagonal.    Hence,  T  has  a  local 

decomposition  with  respect  to  R. 

Q.E.D. 

This  concludes  the  sequence  of  theorems  used  to  prove  Theorem  2.16. 

We  have  shown  how  correspondences  between  directed  graphs,  template 
configurations,  networks  of  processors,  and  matrices  can  be  used  to  provide  necessary 
and  sufficient  conditions  for  the  existence  of  local  decompositions  of  linear  transforms. 
We  did  not  address  the  problem  of  developing  efficient  algorithms  for  obtaining  such 
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decompositions  in  this  chapter.  We  remark  in  closing  that  factoring  a  linear  transforma- 
tion into  a  product  of  local  linear  transformations  is  not  the  only  conceivable  method  of 
implementing  one  locally.  As  we  shall  see,  it  is  a  good  method  for  obtaining  local  algo- 
rithms in  some  cases. 


CHAPTER  3 

TRANSLATION  INVARIANT  AND  CIRCULANT  TEMPLATES 

AND  GENERALIZATIONS 


In  Chapter  3  we  define  and  generalize  translation  invariant  and  circulant  templates. 
These  templates  occur  often  in  digital  image  processing.  The  chapter  is  divided  into 
three  sections.  The  first  section  consists  mainly  of  basic  definitions  and  properties,  and 
descriptions  of  the  matrices  corresponding  to  translation  invariant  and  circulant  tem- 
plates. As  we  mentioned  in  the  introduction  to  Part  I,  these  templates  are  used  to 
implement  convolutions  and  therefore  are  directly  related  to  the  discrete  Fourier 
transform.  We  describe  how  these  relationships  are  manifested  in  the  setting  of  the 
image  algebra.  In  the  second  section,  we  describe  a  family  of  relationships  between  cir- 
culant templates  and  polynomials.  We  show  how  the  problem  of  finding  local  decompo- 
sitions of  circulant  templates  is  equivalent  to  that  of  factoring  multivariable  polynomials. 
In  the  case  of  separable  circulants,  the  problem  reduces  to  the  single  variable  case.  In 
the  third  section,  we  generalize  the  notion  of  a  circulant  template  by  defining  a  class  of 
templates  which  we  call  G-templates.  G-templates  are  translation  invariant  with  respect 
to  Cayley  networks,  which  are  networks  whose  underlying  graph  is  the  group  graph  of 
some  finite  group  G.  Cayley  networks  are  being  studied  as  possible  models  for  parallel 
computer  architectures  [53].  Thus  G-templates  have  potential  applications  to  parallel 
image  processing.  We  show  that  the  set  of  all  G-templates  is  an  algebra  which  is  iso- 
morphic to  the  group  algebra  of  G  over  the  complex  numbers.  We  also  show  that  the 
invertibility  of  G-templates  is  directly  linked  to  the  invertibility  of  the  discrete  Radon 
transform  on  finite  groups  [11]. 
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3.1.  Basic  Definitions  and  Relationships 

In  this  section  we  define  translation  invariant  and  circulant  templates  and  describe 
some  of  their  basic  properties.  We  remark  that  many  of  the  results  of  this  section  are 
essentially  known  in  one  form  or  another.  They  are  new  results  in  the  sense  that  they 
are  theorems  about  objects  in  the  image  algebra.  One  goal  in  developing  an  algebraic 
structure  for  digital  image  processing  is  to  understand  the  relationship  of  the  operations 
used  to  the  existing  theory.  Therefore,  it  is  useful  to  assemble  these  facts  within  the  set- 
ting of  the  image  algebra.  The  inclusion  of  these  results  also  helps  to  keep  this  disserta- 
tion complete  and  accessible  to  researchers  who  may  not  be  familiar  with  some  of  the 
existing  theory. 

In  this  chapter,  we  shall  consider  matrices  and  vectors  to  be  indexed  by  sets  of  the 
form  {  0,1  ,  .  .  .  ,   n-1  }. 

Definition  3.1.  Let  X  be  a  finite  subset  of  Zk.    We  say  that  T  E  Lx  is  translation  invari- 
ant if  and  only  if  for  every  translation  <j>  :  Zk  — ►  Zk,  if  <j>{x),<j>{y)  E  X,  then  the  equation 

tx(y)  =  Mx)(4>(y))  holds. 

Theorem  3.2.  If  T  E  Lx  is  translation  invariant,  then  the  configuration  function  satisfies 

(T(x)  +  z)  n  X  =  7(x+z)  n  (X  +  z) 
for  every  x,z  6  Z    such  thatx,x+z  E  X. 

Proof. 

Assume  that  x,z  €  Zk  such  that  x,x+z  EX. 

Let  y  €  (T{x)  +  z)  D  X.  Then  y  =  y+z  for  some  y  E  T[x).  Let  <j>  :  Zk  -♦  Zk  be 
defined  by  #u)  =  u+z.    Then  ^(x),  tfy)  EX.    Hence  tx+z(y+z)  =  t,(x)(#y))  =  tx(y)  ^ 
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0     which     implies     that     y     =     y+z  £  T(x+z).      Since     y  £  X     it     follows     that 

y£T(x+z)n(X+z). 

Conversely,  assume  that  y  £7(x+z)  PI  (X  +  z).  Since  y  £  (X  +  z),  there  exists  a 
y  £X  such  that  y  =  y+z.  Since  y  £71(x+z),  tx+2(y+z)  7^  0.  Let  <j>  :  Zk  —  Zk  be 
defined  by  </>(u)  s  u-z.    Then  tx+z(y+z)  =  t^x)(</>(y))  =  tx(y)  7^  0  which  implies  that 

y  £  7T(x)   so   y  £  (T(x)  —  z).     By    definition    T(x  +  z)  C  X  so   we    may    conclude    that 

y  e(7tx)  +  z)nx. 

Q.E.D. 

Definition  3.3.  Let  A  =  (a^)  be  an  n  x  n  matrix.  We  say  that  A  is  a  Toeplitz  matrix  if 
and  only  if  for  every  i,j  £  [0,n-l]  and  k  £  Z  such  that  i+k,  j+k  £  [0,n-l],  a^  =  aj+kj+k. 
A  Toeplitz  matrix  is  constant  along  the  diagonals. 


Definition  3.4.  Let  A  =  (ajj)  be  an  mn  x  mn  matrix.    We  say  that  A  is  block  Toeplitz 
with  Toeplitz  blocks  if  and  only  if 


l-i 


An 


Mm-1     ^-m-2 


Am_l 


-1       -rt-0 


where  for  every  i  £  [  -(m-1)  ,  m-1  ],  A;  is  a  Toeplitz  matrix. 

For  the  rest  of  this  section,  let  X  =  {  (i,j)  :  0  <  i  <  m-1,  0  <  j  <n-l  }  be  an  m  x 
n  array.  The  following  results  can  also  be  formulated  in  a  straightforward  fashion  for 
higher  dimensional  arrays.  As  we  remarked  at  the  end  of  Chapter  1,  we  assume  that  the 
matrix  corresponding  to  a  template  is  constructed  using  the  row  major  lexicographic  ord- 
ering. 
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Theorem   3.5,   Let  T  G  Lx   ^e   translation  invariant   and  let  MT  =  (/iy)   be   the   matrix 
corresponding  to  T.    Then  M?  ?s  6/ocAr  Toeplitz  with  Toeplitz  blocks. 


Proof. 


Since  MT  is  an  mn  x  mn  matrix,  we  can  write 


Mn 


Aoo 
Aw 


A0i 

An 


Am_in  Am_! , 


Ao,m-l 
Ai,m-i 


Am-i,m_i 

where  each  Ay  is  an  n  x  n  matrix.  We  first  show  that  if  ij  G  [0,m-l],  then  Ay  is  a  Toe- 
plitz matrix.  Let  Ay  =  (ast)  where  s,t  G  [0,n-l].  Then  ast  =  Min+s>jn+t  =  t(if8)(j,t).  Fix  s 
and  t  and  assume  that  k  G  Z  such  that  s+k,  t+k  G  [0,n-l].  Then  as+kt+k  = 
/^in+s+k,jn+t+k  =  tp^+kjQjt+k)  =  t(i>s)(j,t)  =  ast  which  implies  that  Ay  is  Toeplitz. 

We  now  show  that  MT  is  block  Toeplitz;  that  is,  if  k  G  Z  such  that  i+k,  j+k  G 
[0,n-l],  then  Ai+kJ+k  =  Ay.  Let  Ai+kJ+k  =  (bst)  and  fix  s,t  G  [0,n-lj.  Note  that  ast  = 
Min+s,jn+t  =  t(is)(j,t)  and  bst  =  M(i+k)n+s,(j+k)n+t  =  tfi+k.s/J+M)-  Since  i+k,  j+k  G  [0,m-l] 


and  T  is  translation  invariant,  we  conclude  that  ast  =  bst. 


Q.E.D. 


Corollary  3.6.  The  inverse  of  a  translation  invariant  template  is  not  necessarily  transla- 
tion invariant. 


Proof 

The  inverse  of  a  block  Toeplitz  matrix  with  Toeplitz  blocks  is  not  necessarily  Toe- 
plitz [6,  24]. 
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Q.E.D. 

We  now  define  circulant  templates. 

Remark  3.7.  If  x  =  (x1;  x2)  £  Z"  then  we  denote  (x^mod  m),  x2(mod  n))  by  x  (mod 
(m,n)). 

Definition  3.8.  We  say  that  <f>  :  X  — »X  is  a  circulant  translation  if  and  only  if  <f>  is  of  the 
form  <f>(x)  =  (x  +  h)(mod(m,n))  for  some  h  €  X. 

Note  that  if  z  £  Z  x  Z  and  z  ^  X,  then  z(mod(m,n))  =  y  6  X  and 
(x+z)(mod(m,n))  =  (x+y)(mod(m,n)).  Therefore,  we  shall  sometimes  define  circulants 
in  terms  of  elements  z  ^  X. 

Theorem  3.9.  Let  <f>  =  {  <j>  :  <f>  is  a  circulant  translation  on  X  }.  The  set  4>  equipped  with 
the  operation  of  composition  is  a  group  which  is  isomorphic  to  Zm  x  Zn. 

Proof. 

Note  that  X  =  Zm  x  Zn  as  a  set  and  can  obviously  be  made  into  a  group  iso- 
morphic to  Zm  x  Zn  using  addition  mod  (m,n).  For  each  h  6  X  denote  by  <fih  the  circu- 
lant translation  defined  by  </>h(x)  =  (x  +  h)(mod(m,n)).  Note  that  <f>  £  $  if  and  only  if 
4>  =  4>h  for  some  h  6  X.    Define  ip  '■  X  — ►  <I>  by  t/>(h)  =  </>h.    ip  is  an  isomorphism. 

Q.E.D. 

Definition  3.10.  We  say  that  T  £  Lx  is  circulant  if  and  only  if  for  every  circulant  transla- 
tion <j>,  the  equation  tx(y)  =  t^xj(^(y))  holds.  We  denote  the  set  of  all  circulants  on  X 
byCx. 

Remark  3.11.  A  circulant  template  is  well  defined  by  defining  it  at  one  point.    That  is,  if 


51 

T  E  Cx  and  T(i,j)  =  A,  then  T(x,y)  =  {  (u,v,a(^_1(u,v)))  }  where  <j)  is  the  circulant 
translation  with  the  property  that  ^(i,j)  =(x,y). 

Theorem  3.12.  7/T  E  Cx  is  circulant,  then  T  is  translation  invariant. 

Proof. 

Let  x,y  E  X,  h  E  Z2,  and  <f>  :  Z2  — ►  Z2  the  translation  of  the  plane  by  h.  Assume 
that  cf>{x),  4>(y)  E  X.  Then  (x+h)  =  (x+h)(mod(m,n))  and  (y+h)  =  (y+h)(mod(m,n)). 
Hence  t^x){<f>{y))  =  t(x+h)(mod  (mn))((y+h)(mod  (m,n)))  =  tx(y). 

Q.E.D. 


Definition  3,13.  Let  C  =  (qj)  be  an  n  x  n  matrix.    We  say  that  C  is  a  circulant  matrix  of 
order  n  if  and  only  if  for  every  k  E  Z,  cy  =  c(i+k)(mod  n)i(j+k)(mod  n).  Thus 


r 

1 

c0 

Cl     • 

■     Cn_! 

cn-l 

c0   • 

■     Cn_2 

cl 

c2    • 

■     c0 

We  write  C  =  circ(  c0,  c1;  .  .  .  ,    c^  ). 

Definition  3.14.  Let  B  be  an  mn  x  mn  matrix.  We  say  that  B  is  a  block  circulant  matrix 
with  circulant  blocks  of  type  (m,n)  if  and  only  if  there  exists  circulant  matrices 
co>  ci>-,  Cm_!  such  that  B  =  circ(C0,  Cj,..,  C^). 

The  next  theorem  shows  that  circulant  templates  deserve  their  name. 

Theorem  3.15.  Let  T  E  Cx  be  a  circulant  template.  The  matrix  M-j-  is  a  block  circulant 
matrix  with  circulant  blocks. 


52 


Proof. 

Denote  MT  =  (/ijj).    Since  T  is  translation   invariant  it  follows  that  MT  is  block 
Toeplitz  with  Toeplitz  blocks.  Let 


MT   = 


An 


Lm-1 


A_i     A0 


A^,,,.!)    A_(m_2) 

Let  h  €  [0,m-lj.  We  show  that  Ah  is  a  circulant  matrix.    Denote  Ah  =  (a^).    Then  ajj  = 

Mi,hn+j  =  V.'A'J)-  Let  k  ^  Z-  Then  a(i+k)(mod  n),(j+k)(mod  n)  =  ^(i+k)(mod  n),hn+((j+k)(mod  n))  — 

*(o.(i+k)(mod  n))(ni  (j+k)(mod  n)).  Define  the  circulant  translation  </>  :  X  — ►  X  by  <^>(u,v)  = 
(u,(v+k)  (mod  n)).  Then  a^  =  t(0i)(hj)  =  t^0ii)(^(h,j))  =  t(0(i+k)(mod  n))(h,  (j+k)(mod  n)) 
a(i+k)(mod  n),(j+k)(mod  n)-   Thus  we  may  conclude  that  Ah  is  a  circulant  matrix. 

We  now  show  that  MT  is  block  circulant,  that  is,  if  h  =  k(mod  m),  then  Ah  =  Ak. 
There  are  two  possibilities,  h  =  k,  which  is  trivial,  and  (say)  h  >  0  and  k  <  0.  In  the 
latter  case  we  have  h-k  =  m.  Let  Ah  e=  (a^)  and  Ak  =  (bjj).  Then  ay  =  Mi,hn+j  = 
t(o,i)(h,j)  and  b^  =  /i_kn+i,j  =  t(-k,i)(0J)-  Define  the  circulant  translation  <f>  :  X  — ►  X  by 
^(u,v)  eee  ((u+h)(mod  m),v).  Then  by  =  t{_k)i)(0,j)  =  t^(_ki)(^(0,j))  =  t(0ji)(h,j)  =  aij 
which  shows  that  Ah  =  Ak  and  proves  the  theorem. 

Q.E.D. 

Corollary  3.16.  (Cx,  ©,  +)  is  a  commutative  ring  with  unity. 

Proof. 

This  follows  immediately  from  the  facts  that  the  set  of  all  block  circulant  matrices 
with  circulant  blocks  of  type  (m,n)  forms  a  commutative  ring  with  unity  [10]  and  that 


53 


the  mapping  #  :  Lx  — ►  Mmn  is  an  isomorphism. 


Q.E.D. 


As  mentioned  previously,  translation  invariant  and  circulant  templates  are  used  to 
implement  convolutions  in  the  image  algebra.  In  fact,  if  T  £  Cx,  then  the  mapping 
A  — ►  A©T  is  the  circular  convolution  of  the  two-dimensional  sequences  ay  =  a(i,j)  and 
by  =  t(00)((-i,-j)(mod(m,n))).  Translation  invariant  templates  occur  often  in  digital 
image  processing.  The  circulant  templates  are  closely  related  to  the  discrete  Fourier 
transform  and  the  theory  of  fast  convolutions.  There  are  various  techniques  available  for 
approximating  translation  invariant  computations  by  circular  convolutions  [1,  37].  The 
simplest  techniques,  which  are  adequate  for  small  convolutions,  are  either  to  pad  the 
array  with  zeroes  or  to  ignore  the  boundary  effects. 

The  definitions  and  theorems  presented  in  this  section  help  to  provide  a  link 
between  image  algebra  and  matrix  algebra  associated  with  the  discrete  Fourier 
transform.  We  now  describe  this  link. 

Definition  3.17.  Let  A  be  an  m  x  n  matrix  and  B  an  s  x  t  matrix  both  with  entries  in  F. 
The  Kronecker,  or  tensor,  product  of  A  and  B,  denoted  A®B,  is  the  ms  x  nt  matrix 
given  by 


A®B  = 


a00B     .    .      a0n_iB 


am-i,oB    •    •    a,m-i,n-iB 


It  is  well  known  that  (A®  B)(C<g)D)=AC(g)BD  provided  that  the  matrices  are  all 


of  the  appropriate  dimensions.    Hence, 


k 
IIAi 

i=0 


i=0 
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Let  i   =  V-T. 


Definition  3.18.  Define  K  €  Lx  by 

K(u,v)  =  -^^exp[-27ri(uX/n+vY/m)] 
vnm 

where  X,Y  €  Fx  are  defined  by  X  =  {  (x,y,z)  :  x  =  z  }  and  Y  =  {  (x,y,z)  :  y  =  z  }.  K 
is  called  the  two-dimensional  DFT  template.  The  transform  A  — ►  A©K  is  the  DFT  of 
A.   K  is  an  invertible  template  with  inverse  given  by 

K-1(u,v)  =  -pL^exppjrifuX/n+vY/m)]  . 
vnm 

Let  u;n  =  exp[-27ri/n]. 

Definition  3.19.  Denote  by  Fn  the  n  x  n  matrix  with  ij-th  entry  equal  to  n"1'2^  for  j  £ 
[0,n-l],that  is,  Fn  =  n~  '"(a;n'J).  Fn  is  called  the  one- dimensional  Fourier  matrix  of  order 
n.  If  x  6  Cn  then  Fnx  is  the  one-dimensional  discrete  Fourier  transform  (l-dim  DFT)  of 
x 

Definition  3.20.  Let  X  be  an  m  x  n  array  and  let  K  be  the  Fourier  template.  Let 
Fmxn  =  ^(K).  The  matrix  Fmxn  is  called  the  two-dimensional  Fourier  matrix  of  order  m 
x  n. 

Note  that  Fmxn  7^  Fmn  in  general.  Note  further  that 

^mxn  =  Fm®Fn  =  (Fm® In)(Im®Fn).  This  last  observation  is  an  equation  which 
expresses  the  fact  that  the  two  dimensional  DFT  can  be  computed  by  first  computing 
one  dimensional  DFT's  along  each  row  and  then  along  each  column  of  the  result. 

Let  P   =  circ(0,l,0,...,0)   be  n   x  n.   If  C  =  circ(c0,c1,...,cI1_1),   then   let  fc(x)   = 
c0+c1x+...+cn_1xn-1.  Then  C  =  fc(P).  Let  U  =  diag^),  i=0,l,...,n-l.    Then,  using  the 
fact  that 
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b-1  (n    if 

2>»     =      o   ek 

=0  I. 


n^i    ;t  Jn    if  k  =  0  (mod  n) 

;lse 


i=o 

it  can  be  seen  that  P  =  FnOFn  where  *  denotes  the  conjugate  transpose.  It  follows  that 
if  C  is  circulant,  then  C  =  FnAFn*  where  A  =  diag(fc(wn')),  i— 0,l,...,n-l.  This  is  the 
matrix  formulation  of  the  circular  convolution  theorem.  Similarly,  if  B  is  a  block  circu- 
lant matrix  with  circulant  blocks  of  type  (m,n),  then  B  =  (Fm®Fn)D(Fm<S)Fn)  where  D 
is  a  diagonal  matrix. 

Thus,  the  circulant  matrices  are  all  simultaneously  diagonalizable  by  the  Fourier 
matrices.  In  the  language  of  image  algebra,  we  can  say  that  if  T  6  Cx,  then  there  exists 
a  template  S  6  Lx  such  that  S[x)  C  {x}  for  every  x  E  X  and  T  =  ((A©K)©S)©K-1. 
Note  that  the  computation  B  ©  S  can  be  implemented  using  pointwise  multiplication. 
This  is  how  the  fast  Fourier  transform  (FFT)  can  be  used  to  facilitate  the  computation 
of  convolutions. 

The  Fourier  template  can  be  seen  as  inducing  an  isomorphism  of  Cx  onto  Cx.  This 
is  the  image  algebra  version  of  the  convolution  theorem. 

Theorem  3.21.  The  mapping  4>  :  Cx  — ►  Cx  defined  by  O(S)  =  S(0,0)©K  is  a  ring  iso- 
morphism o/Cx  onto  Cx. 

Proof. 

If  R,S  6  Cx  then,  since  the  mapping  A  — ►  A©K  is  linear  and  addition  of  templates 
is  defined  pointwise,  4>(R)  +  $(S)  =  4>(R+S).  By  definition  of  the  discrete  Fourier 
transform,  4>(E)  =  E(0,0)©K  =  I.  To  see  that  *  is  onto,  let  B  €  Cx,  let  A  =  B©K_1, 
and  define  S  G  Cx  by  taking  S(0,0)  =  A  and  extending  by  circularity.  Then  $(S)  =  B. 
Suppose  that  $(R)  =  4>(S).    Then  R(0,0)  ©  K  =  S(0,0)  ©  K  which  implies  that  R(0,0) 
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=  S(0,0)  since  K  is  invertible.  Since  R  and  S  are  determined  by  their  values  at  one 
point,  we  may  conclude  that  R  =  S  and  therefore  that  4>  is  1-1. 

Let  T  =  R  ©  S  and,  for  every  i,j  €  X,  define  <j>-^  :  X  — ►  X  by  <^j(x,y)  =  (x-i  (mod 
m),y-j  (mod  n)).  Then 

T(0,0)  =  {  (x,y,t(0i0)(x,y)  :  t(0>0)(x,y)  =     £  s(0>o)(i)j)r(iJ)(x,y)  = 

(iJ)GX 

=     E  S(o,o)(i,j)r^(y)(0(xjy))  = 

(ij)ex 

=     E  s(o,o)(i,J)r(o,o)((x-i)(mod  m),(y-j)(mod  n))  }  . 

(iJ)GX 

The  last  expression  for  t(0o)(x,y)  represents  the  gray  values  of  T(0,0)  as  the  cyclic  convo- 
lution of  the  two  dimensional  sequences  {  Sf00)(iJ)  }(ij)gx  and  {  r(o,o)(iJ)  }(i,j)ex-  By  the 
convolution  theorem,  T(0,0)©K  =  (R(0,0)©K)*(S(0,0)©K)  which  implies  that 
<J>(R©S)  =  4>(R)*<i>(S)  . 

Q.E.D. 

One  of  the  useful  properties  of  the  discrete  Fourier  transform  is  that  it  converts 
convolutions  into  pointwise  multiplications,  a  fact  which  is  expressed  in  Theorem  3.21. 
Because  of  this,  local  algorithms  for  computing  convolutions  can  be  derived  by  deriving 
local  algorithms  for  computing  discrete  Fourier  transforms.  A  similar  situation  exists  for 
other  invertible  linear  transformations. 

Definition  3.22.  We  say  that  T  £  Lx  is  a  diagonal  template  if,  relative  to  some  ordering 
on  X,  MT  is  a  diagonal  matrix.  By  Theorem  1.12,  if  MT  is  diagonal  relative  to  some  ord- 
ering on  X  then  for  every  ordering  on  X,  ty(T)  is  a  diagonal  matrix. 

With     each     diagonal     template    T     we     associate     the     image    AT    defined     by 

AT  =  £  T(x).  Note  that  if  AT  =  {  (  x  ,  a(x))  },  then  a(x)  =  tx(x).  Moreover,  A©T  = 
xex 
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A*AT. 

Theorem  3.23.  Assume  that  T  £  Lx  is  invertible.  Define 

LX(T)    =    {  T^QDOT  :  D  is  a  diagonal  template  }  . 
LX(T)  is  a  commutative  ring  which  is  isomorphic  to  C    . 

Proof. 

The  set  Lx(T)  is  a  commutative  ring  since  ty(Lx(T))  = 
{  MjAM-f1  :  A  is  a  diagonal  matrix  }  is  a  commutative  ring.  Define  <I>  :  LX(T)  — ►  C  by 
4>(S)  =  AD  where  D  =  T©S©T~X  for  S  6  Lx(T).  The  verification  that  4>  is  an  isomor- 
phism is  routine  and  we  omit  it. 

Q.E.D. 

Thus,  if  methods  can  be  found  for  implementing  invertible  linear  transforms  locally, 
then  those  methods  can  be  used  to  implement  a  larger  class  of  linear  transforms  locally. 

We  have  defined  translation  invariant  and  circulant  templates  and  outlined  their 
relationships  to  matrix  algebra  associated  with  the  discrete  Fourier  transform.  We  now 
describe  some  relationships  between  circulant  templates  and  polynomials. 

3.2.  Circulant  Templates  and  Polynomials 

In  this  section,  we  describe  a  family  of  relationships  between  circulant  templates 
and  quotient  rings  of  polynomial  rings.  We  show  that  the  problem  of  finding  decomposi- 
tions of  circulant  templates  is  equivalent  to  factoring  multivariable  polynomials.  We 
show  that  for  separable  circulant  templates  the  problem  reduces  to  the  single  variable 
case  and  is  therefore  equivalent  to  finding  roots  of  polynomials  in  one  variable.    We  show 
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how,  by  the  use  of  shifts  and  the  fundamental  theorem  of  algebra,  minimal  local  decom- 
positions of  separable  circulant  templates  can  be  obtained.  We  develop  the  results  in  the 
two-dimensional  setting.  As  was  true  in  the  previous  section,  they  extend  easily  to  the 
higher  dimensional  cases. 

Throughout  this  section,  x,  y,  z  will  denote  elements  of  the  m  x  n  array  X, 
0  =  (0,0),  x  and  y  will  denote  indeterminates,  and  C[x,y]  the  ring  of  polynomials  in  two 
variables  with  coefficients  in  C.  We  let  C[x,y]/(xm-l,  yn-l)  denote  the  quotient  ring  of 
polynomials  mod  xm-l  and  yn-l.  We  consider  the  elements  of  this  ring  to  be  polynomials 
on  which  multiplication  is  performed  by  replacing  xm  and  yn  with  1  wherever  they 
appear  rather  than  as  cosets.  We  also  denote  by  V  the  template  configuration  defined 
by  V(i,j)  =  {  (i,j),  (i+1  (mod  m),j)  (i-1  (mod  m),j))  (i,j+l  (mod  n))  (i,j-l  (mod  n))  }.  Fis 
called  the  von  Neumann  configuration.  In  this  section,  local  shall  mean  local  with 
respect  to  this  particular  configuration. 

For  every  z  £  X  we  define  a  mapping  Tz  :  Cx  — *  C[x,y]/(xm-l,yn-l)  by 

m-l  n-1 

I\(T)  =  pt(x,y,«)  =  ££tz(i,j)xy. 

i=0j=0 
If  T  is  a  circulant  template  then  Tz  is  called  a  polynomial  representative  of  T. 

Example.  Let  R  be  the  circulant  template  depicted  in  figure  3.1.   Then 

r(o,o)(R)  =  4  +  6y  +  2y11-1  +  6x  +  9xy  +  Sxy""1  +  2xm-1  +  3xm"1y  +  x^y""1 
and 

r(i,i)(R)  =  1  +  2y  +  3y2  +  2x  +  4xy  +  6xy2  +  3x2  +  6x2y  +  9x2y2 
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R  = 


1 

2 

3 

o 

4 
S      / 

6 

3 

6 

9 

Figure  3.1.  A  circulant  template  R. 


Properties. 


is 


1.)  If  T  is  a  circulant  template  and  r\00)(T)  =  x-a  or  y-a  for  some  a,  then  T  i 

local. 

2.)   If  T   is   a   circulant    template    and    I\lj0)(T)    =    a0+a1x+a2x2  or   T(0  ^(T)   = 
a0+aiy+a2y">  then  T  is  local. 

3.)  If  T  is  a  circulant  template  and  F(00)(T)  =  x'y^,  then  the  circulant  transform  A 
— ►  A©T  simply  circularly  shifts  all  the  gray  levels  i  units  vertically  and  j  units  horizon- 
tally. 
Theorem  3.24.  IfS  and  T  are  circulant  templates,  then  ro(S©T)  =  ro(S)ro(T). 


Proof. 

For  any  z  6  X  denote  by  <f>z  the  circulant  translation  defined  by  <^(x)  =  (x-z) 
mod(m,n).  Note  that  <j>z(x)  =  0.  Recall  that,  if  R  =  S©T,  then 
R(°)  =  {  (y-ro(y))  :  r0(y)  =     £   to(z)sz(y)  }■  Since  S  is  a  circulant  sz(y)  ==  s0(<£2(y))  = 

z€T(0) 
s0((y-z)        mod(m,n)).        But        by        definition        of        polynomial        multiplication 

(in  C[x,y]/(xm-l,yn-l))    the    numbers      £   ^(^((y-2)  mod(m,n))    are    precisely    the 

zei\o) 
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coefficients  of  the  polynomial  product  ps(x,y,0)pt(x,y,0). 

Q.E.D. 

Theorem  3.25.  For  every  z  =  (j,k)  G  X,  Tz  is  1-1  and  onto.  Moreover,  if  S,T  E  Cx  then 

i.)  r,(s)+rz(T)  =  rz(s+T). 

2.)r,(T)  =  xVkr0(T). 

3.)  r2(s0T)xVk  =  rz(s)r2(T). 

Proof: 

Let  z  =  (j,k)  £  X.  The  mapping  Tz  is  clearly  onto  and  is  1-1  since  Cx  contains 
only  circulant  templates  with  minimal  configuration.  Equation  1.)  follows  from  the  fact 
that  addition  of  templates  is  defined  pointwise. 

To  prove  2.)  let  <j>  be  the  circulant  translation  defined  by  <f>(x)  =  (x+z)  mod(m,n). 
Note  that  ^(0)  =  z.    Let  T  be  an  arbitrary  circulant  template.  Then 

m-l  n-1 

xVkpt(x,y,0)  =  xJxkX;  Y,  t0(i,h)xiyh  = 

i=0h=0 


m-l  n-1 

E  E  *.((i+j)  mod  m,(h+k)  mod  n)xi+Vh+k 

i=0h=0 


m-l  n-1 

=  E  £  tz(i,h)xsyh  =  pt(x,y,z) 

i=0h=0 

since  xm  ^  yn  ^  1  and  T  is  circulant.  This  enables  us  to  deduce  that  TZ(T)  =  xVkr0(T) 

or  xmV-krz(T)  =  r0(T). 

To  prove  3.),  observe  that,  since  ro(S©T)  =  r0(S)r0(T),  we  have  that 
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rI(s©T)  =  xJykr0(s)r0(T)  = 


=  xjyk 


xm-Vn-krz(S)xm-Vn-krz(T)  ]  =  xm'Y-kTz(S)Tz(T). 
The  desired  result  now  follows  by  multiplying  the  last  equation  by  x-iyk 


Q.E.D. 


Example.  Let  R  be  the  circulant  template  depicted  in  figure  3.1.  Note  that  T^^R)  = 
xyI\00)(R).  Let  S  and  T  be  the  circulant  templates  depicted  in  figure  3.2.  Then  S©T  = 
R,  r(1(1)(S)  =  y  +  2xy  +  3x2y,  r^ ,i)(T)  =  x  +  2xy  +  3xy  .  A  simple  polynomial  multi- 
plication shows  that  xyl^n  (i)(S©T)  =  xyl^^R)  =  I'm  j^S)!^  ,i)(T). 


1 

7 — Nj 
2 

N      / 

3 

R  = 


V — N 
2 

n"  a 


Figure  3.2.  Circulant  templates  S  and  T. 

Corollary  3.26.  T0  is  an  isomorphism. 

Corollary  3.27.  Let  zx  =  (i,j),  z2  =  (s,t)  €  X  and  T  e  Cx.  Ifp^xj^)  =  p1(x)y)p2(x,y)) 
then  Pt(x,y,z2)  =  qi(x,y)q2(x,y)  where  q^xj)  =  x8"'yHPi(x,y)  and  q2(x,y)  =  p2(x,y). 

Corollary  3.28.  //  z  =  (ij)  6  X,  T  e  Cx,  and  Pl(x,y,z)  =  Pl(x,y)p2(x,y),  then  T  = 
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T1©T2©S,  where  Tx  =  IY^p^xj)),  T2  =  rz":(p2(x,y)),  and  S  represents  a  circular 
shift  o/n-j  units  horizontally  and  m-i  units  vertically.  Moreover,  since  S  represents  a  cir- 
cular shift,  it  can  be  replaced  by  a  template  S  which  represents  a  circular  shift  j  units 
horizontally  and  i  vertically. 

The  Tz  's  constitute  a  class  of  mappings  between  image  and  polynomial  algebra 
which  are  almost  isomorphisms.  Since  they  differ  from  isomorphisms  only  by  shifts,  they 
can  be  used  in  the  same  way.  Clearly  if  any  of  the  polynomial  representatives  of  a  tem- 
plate can  be  factored,  then  the  template  can  be  factored  correspondingly.  Thus,  the 
template  decomposition  problem  is  equivalent  to  the  problem  of  factoring  multivariable 
polynomials.  It  is  interesting  to  observe  that  there  is  research  being  done  in  the  area  of 
using  computer  programs  to  determine  exact  factorizations  of  multivariable  polynomials 
over  the  rationals  [25,  34].  We  now  examine  some  specific  consequences  of  these 
theorems.  We  will  use  them  to  show  how  any  separable  circulant  template  can  be  imple- 
mented locally  with  respect  to  the  von  Neumann  configuration  and  give  upper  bounds 
on  the  number  of  parallel  steps  required.  Since  the  von  Neumann  restriction  on  an  m  x  n 
array  simulates  mesh-connected  arrays  of  processors,  these  methods  can  be  used  on  such 
machines. 

m-l  n-1 

Definition  3.29.  If  f(x,y)  =  £  EayxV,  then 

i=0j=0 

deg(f(x,y))  =  max  {  i+j  :  ay  ^  0  }. 
degx(f(x,y))  =  max  {  i  :  ay  ^  0  }. 
degy(f(x,y))  =  max  {  j  :  ay  ^  0  }. 

Definition  3.30.  If  T  is  a  circulant  template  and  deg(pt(x,y,z))  <  deg(pt(x,y,w))  for  every 
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w  E  X,  then  we  say  that  z  is  a  minimal  point  for  T. 

Definition  3.31.  Let  T  be  a  circulant  template.  We  say  that  T  is  separable  if  there  exist 
polynomials  f(x)  and  g(y)  such  that  pt(x,y,0)  =  f(x)g(y).  By  the  preceding  corollaries  if  T 
is  separable,  then  for  every  z  £  X  there  exists  polynomials  f(x)  and  g(y)  such  that 
Pt(x,y,z)  =  f(x)g(y). 

The  following  theorem  yields  a  systematic  method  for  computing  any  separable  cir- 
culant transform  locally  with  respect  to  a  four-connected  processor  array  and  gives  an 
upper  bound  on  the  number  of  parallel  steps  required. 

Theorem  3.32.  Let  T  be  a  separable,  circulant  template  and  let  z  =  (s,t)  be  a  minimal 
point  for  T.  Let  k  =  degx(pt(x,y,z))  and  j  =  degy(pt(x,y,z)).  Let  v  =  min(s,m-s)  and  h 
=  min(t,n-t).  The  circulant  transform  A  — ►  A©T  can  be  computed  in  at  most 
v+h+k+j+1  local,  parallel  steps.  Specifically,  v+h  of  these  steps  consist  of  vertical  or 
horizontal  circular  shifts  of  the  entire  array  by  one  location,  k+j  of  the  steps  consist  of  at 
most  one  addition  and  one  multiplication  (possibly  complex)  per  pixel,  and  one  of  the 
steps  consists  of  at  most  one  multiplication  per  pixel. 

Proof. 

Since  T  is  separable,  there  exists  f(x)  and  g(y)  such  that  pt(x,y,z)  =  f(x)g(y).  By 
assumption,  deg(f(x))  =  k  and  deg(g(y))  =  j.  By  the  fundamental  theorem  of  algebra, 
f(x)  =  a(x-q1)(x-q2)...(x-qk)  and  g(y)  =  b(y-r1)(y-r2)...(y-rj)  where  a,b,q1,...,qk,r1,...,rj  are 
(possibly)  complex  numbers.  Choose  circulant  templates  QJl^  ,..,Qi;,  Ri,...,Rj  such 
that  T0(Q)  =  f(x),  r0(R)  =  g(y),  r0(Q;)  =  x-qi,  and  r0(RO  =  y-rj.    By  Property  1.),  the 

Qi  and  R;  are  local.    By  construction,  TZ(T)  =  r0(Q)r0(R).    Furthermore,  by  Theorem 

k  k  j 

3.24,  r0(Q)  =  a(TTTo(Qi))  which  implies  that  Q  =  a(©Qj).    Similarly,  R  =  b(©R:). 
i=i  i=i-  i=i 


Hence,  by  Corollary  3.28,  T  =  ab 


k  j 

(©Qi)0(©Ri)0S 

i=l  i=l 
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This  last  equation  expresses  T 


as  a  product  of  local  templates.  The  number  of  steps  required  to  implement  the  circu- 
lant  transform  A  — ►  A©T  is  k+j+1  multiplication  steps  plus  the  number  of  shifts 
required  which  is  v+h. 

Q.E.D. 

The  templates  are  shown  in  Figure  3.3.  The  slashes  indicate  the  location  of  the 
center  pixel. 


Qi  = 


7^\ 

-qi 


i 


R, 


/    \ 

-n 

i 

\    / 

Figure  3.3.  Templates  used  in  Theorem  3.34 


As  an  example  suppose  X  is  a  512  x  512  array  and  T  is  a  30  x  30  separable  tem- 
plate. Then  z  =  (14,14)  and  v  =  h  =  14.  Hence,  if  A  G  Kx,  then  A  —  A©T  can  be 
computed  locally  in  parallel  with  60  parallel  steps  consisting  of  at  most  one  multiplica- 
tion and  addition  per  pixel,  1  step  consisting  of  one  multiplication  per  point,  and  a  total 
of  28  unit  horizontal  or  vertical  circular  shifts.  Since  the  computation  in  its  original  form 
required  900  multiplications  per  point,  it  is  clear  that  template  decompositions  can  be 
used  to  derive  algorithms  which  are  more  efficient  with  respect  to  the  number  of  arith- 
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metic  operations  as  well  as  parallel.  Nontrivial  examples  of  such  templates  are  the 
discretizations  of  the  Marr-Hildreth  edge  operators  [66]. 

The  next  theorem  also  yields  a  systematic  method  for  computing  separable  circu- 
lant  transforms  locally  and  avoids  complex  multiplications. 

Theorem  3.33.  Assume  that  T  is  a  separable  circulant  template  with  real  weights  and 
that,  for  every  z  6  X,  pt(x,y,z)  =  T^T)  =  f2(x)  is  a  polynomial  in  x  only.  Let  (p,0)  be  a 
minimal  point  for  T. 

1.)  Assume  that  deg(f(p0)(x))  =  2k  for  some  k  >  1  and  let  v  =  min(m-|k-p|,|k-p|). 
The  circulant  transform  A  — ♦  A©T  can  be  computed  in  k+v+1  local,  parallel  steps. 
Specifically,  v  of  these  steps  consist  of  vertical,  circular  shifts  of  the  entire  array  by  one 
location,  k  of  these  steps  consist  of  at  most  2  real  multiplications  and  additions  per  pixel, 
and  1  step  consists  of  at  most  1  real  multiplication  per  pixel. 

2.)  Assume  that  deg(f(p ?0)(x))  =  2k+l  for  some  k  >  1.  The  conclusion  of  1.)  holds 
with  k  replaced  by  k+1. 

Proof. 

k 
1.)  We  can  write  F(p0)(T)  =  cFJq^x)  where  the  q^x)  are  monic,  quadratic  polyno- 

i=l 

mials  with  real  coefficients  and  c  is  a  real  number.  For  each  i  £  [l,k]  define  Q;  £  Lx  by 
Qi  =  r^^q^x)).    By  Property  2.),  the  Q;  are  local  templates.  Furthermore, 

r0(T)  =  x-pr(p>0)(T)  =  cx-'nr(i,o)(Qi)  =  cx^rtr^) 

i=l  i=l 

k 

Hence,  T  =  c(©  Qj)  ©  S  where  S  is  a  shift  template. 

i=l 
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If  k  >  p,  then  we  can  take  S  to  be  a  shift  of  vx  =  min(k-p,  m-(k-p))  units.  If  k  < 
p,  then  we  can  take  S  to  be  a  shift  of  v2  =  min(p-k,  m-(p-k))  units.  In  any  case,  the 
shift  can  be  executed  in  at  most  v  steps. 


2.)  The  proof  is  almost  identical.  We  write  T(p  0)(T)  =  c 


FN*) 


i=l 


p(x)  where  the 


qi(x)  are  monic,  quadratic  polynomials  with  real  coefficients,  p(x)  is  a  monic  linear  poly- 
nomial with  real  coefficients,  and  c  is  a  real  number.  We  take  P  to  be  the  local  template 
P  =  r0"1(p(x)).    Taking  the  Q;  to  be  as  in  1.)  yields 


r0(T)  =  ex""3 


IF(i,o)(Qi) 

i=l 


r0(P)  =  ex 


—  ™k-p 


rFo(Qi) 


i=l 


r„(P) 


The  rest  of  the  proof  is  identical  to  the  latter  part  of  the  proof  of  part  1.). 


Q.E.D. 


Clearly  an  analogous  theorem  holds  for  circulant  templates  T  with  the  property 
that  for  every  z  6  X,  TZ(T)  =  gz(y)  are  polynomials  in  y  only. 

Let    |"x  1  denote  the  smallest  integer  greater  than  or  equal  to  x. 
Corollary  3.34.  Assume  that  T  is  a  separable  circulant  template  and  let  (p,q)  be  a  minimal 


point  for  T.    Let  k 


de      Pt(x,y,(p,q)) 


and  j 


,       .  Pt(*,y,(p,q))  v 
de&yl  0  j 


.    Let  v  = 


min(m-|k-p|,|k-p|)  and  h  =  min(n-[j-q|,[j-q|).  The  circulant  transform  A  — +  A©T  can  be 
computed  in  k+j+v+h+1  local,  parallel  steps.  Specifically,  v+h  of  these  steps  consist  of 
vertical  or  horizontal  shifts  of  the  entire  array  by  one  location,  k+j  of  these  steps  consist 
of  at  most  2  real  multiplications  and  additions  per  point,  and  one  of  the  steps  consists  of 
at  most  one  real  multiplication  per  point. 
The  templates  corresponding  to  the  quadratic  polynomials  used  in  this  technique  are 


depicted  in  Figure  3.4. 
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R; 


t*— N 

a0 

ai 
S    / 

1 

ao 

Qi  = 

aj 
\      / 

1 

Figure  3.4.  Quadratic  templates  used  in  Corollary  3.34. 

Remark  3.35.  The  latter  technique  will  generally  be  preferable  to  the  former  since  one 
complex  multiplication  takes  at  least  three  real  multiplications  and  five  real  additions  or 
four  real  multiplications  and  two  real  additions  [6]. 

We  have  described  a  family  of  relationships  between  circulant  templates  and  poly- 
nomial algebra.  We  used  these  relationships  to  show  how  local  decompositions  of  separ- 
able circulant  templates  can  be  obtained  by  factoring  or  finding  roots  of  polynomials  in  a 
single  variable.  Finding  roots  of  polynomials  can  be  a  numerically  unstable  procedure 
[38].  In  recent  years,  research  has  been  done  on  developing  computer  programs  to  factor 
polynomials  exactly  [25,  34).  This  work  could  be  applicable  to  the  development  of  paral- 
lel algorithms  for  computing  convolutions. 

3.3.  G-templates 


In    this   section   we   generalize   the   notion   of  circulant   templates   by   defining   G- 
templates.    A  G-template  will  be  defined   as  a  template  which  is  translation  invariant 
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with  respect  to  a  digraph  induced  by  a  template  configuration  which  admits  of  a  group 
structure.  In  the  case  that  the  configuration  is  the  von  Neumann  configuration,  the 
group  will  turn  out  to  be  Zm  x  Zn  and  the  set  of  all  G-templates  will  be  the  algebra  of 
all  circulant  templates.  Similarly,  we  will  show  that  the  set  of  all  G-templates  for  other 
configurations  is  isomorphic  to  the  group  algebra  over  C  of  the  group  corresponding  to 
the  template  configuration.  We  will  also  show  that  the  invertibility  of  a  G-template  is 
directly  related  to  the  invertibility  of  the  discrete  Radon  transform.  We  conclude  this 
section  with  a  discussion  of  the  possible  applications  of  G-templates  to  parallel  process- 
ing. 

Throughout  this  section,  let  X  C  Z    be  a  finite  set,  R  a  template  configuration  on 
X,  G  a  finite  group  (written  multiplicatively),  and  assume  that  images  have  values  in  C. 

Definition  3.36.  Let  A  =  {  g1;  g2,  .  .  .  ,  g^  }  be  a  set  of  generators  of  G.  The  Cayley 
color  graph,  or  group  graph,  of  G  with  respect  to  A  is  the  digraph  DA(G)  =  (V,E)  where 
V  =  G  and  (x,y)  6  E  if  and  only  if  there  exists  g;  £  A  such  that  xgj  =  y.  If  (x,y)  £  E 
and  xgj  =  y,  then  we  say  that  the  arc  (x,y)  is  colored  gj,  or  has  the  color  gj. 

Definition  3.37.  An  automorphism  of  a  digraph  is  a  permutation  a  on  V(D)  with  the  pro- 
perty that  (u,v)  6  E(D)  if  and  only  if  (<r(u),cr(v))  6  E(D).  A  color-preserving  automor- 
phism of  a  Cayley  color  graph,  DA(G),  is  an  automorphism,  a,  of  DA(G)  with  the  pro- 
perty that  for  every  x,y  G  V(DA(G))  the  arcs  (x,y)  and  (<r(x),cr(y))  have  the  same  color. 

Fact  3.38.  The  set  of  color  preserving  automorphisms  of  DA(G)  is  a  group  under  compo- 
sition which  is  isomorphic  to  G.  An  isomorphism  is  the  mapping  g  — ►  <jg  where  <rg  is  the 
automorphism  of  DA(G)  defined  by  <7g(h)  =  gh.  Note  that  this  implies  that  the  group  of 
color  preserving  automorphisms  is  the  same  for  every  group  graph  constructed   using 
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generating  sets  of  G  [4]. 

Definition  3.39.  Let  Dx  and  D2  be  digraphs.  We  say  that  D!  is  isomorphic  to  D2  if  and 
only  if  there  exists  a  1-1,  onto  mapping,  a  :  V(D1)  —►  V(D2),  with  the  property  that 
(x,y)  E  E(Dj)  if  and  only  if  (a(x),a(y))  <E  E(D2). 

Definition  3.40.  We  say  that  the  pair  (X,i?)  simulates  the  group  G  if  and  only  if  T>(R)  is 
isomorphic  to  DA(G)  as  digraphs  for  some  generating  set  A.  We  assign  to  each  arc  in 
D(/?)  the  same  color  as  the  associated  arc  in  DA(G). 

Theorem  3.41.  If  (X,i?)  simulates  G  then  a  binary  operation,  o,  can  be  defined  on  X 
which  makes  it  a  group  isomorphic  to  G. 

Proof 

Since  (X,i?)  simulates  G  there  exists  a  generating  set,  A,  of  G,  and  a  1-1,  onto 
mapping  a  :  V(D(/?))  -»  V(DA(G)).  By  definition  of  the  respective  digraphs, 
V(D(J2))  =  X  and  V(DA(G))  =  G  which  implies  that  a  :  X  ^G.  Define  o  by 
xoy  =  oT  (o(x)a(y)).  Checking  to  see  that  X  is  a  group  under  o  is  a  routine  matter  and 
we  omit  it. 

Q.E.D. 

Remark  3.42.  Henceforth,  if  (X,i?)  simulates  G,  then  we  identify  X  with  G  whenever  it  is 
convenient.  If  we  order  X  =  {  x0,  xlr.,  xn_j  },  then  we  assume  that  x0  =  e  is  the  iden- 
tity element  of  G.  If  <j> :  G  -*■  G,  then  we  can  define  a  map  (f>*  :  X  -*  X  by  <?!>*(x)  = 
oT  (0(q(x))).    In  such  a  case  we  write  <j>  for  either  map. 

Definition  3.43.  Assume  that  (X,#)  simulates  G.    We  say  that  T  €  Lx  is  a  G-template  if 
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and  only  if  for  every  color-preserving  automorphism  <j>  :  X  — ►  X,  the  equation  t^x)(</>(y)) 
=  tx(y)  holds.    We  denote  by  Gx  the  set  of  all  G-templates. 

Remark  3.44.  The  set  of  G-templates  is,  in  a  sense,  the  set  of  all  templates  which  are 
translation  invariant  with  respect  to  the  group  graph  DA(G)  or  D(R)  since  the  color- 
preserving  automorphisms  are  essentially  translations  within  the  group. 

We  now  show  that  G-templates  are  generalizations  of  circulant  templates. 

Theorem  3.45.  Assume  that  X  is  an  m  x  n  array  and  that  R  =  V,  the  von  Neumann 
configuration.  The  pair  (X,  V)  simulates  G  =  Zm  x  Zn  (written  additively).  Furthermore, 
Gx  =  Cx,  the  set  of  all  circulant  templates  on  X. 

Proof. 

Let  A  =  {  (1,0),  (m-1,0),  (0,1),  (0,n-l),  (0,0)  }.  Note  that  X  is  isomorphic  to 
Zm  x  Zn  where  the  operation  on  X  is  addition  mod  (m,n).  Hence  we  can  take 
a  :  X  — ►  G  to  be  the  identity  mapping  and  we  only  need  to  show  that  if  (x,y)  £  E(D(  V)) 
then  (x,y)  6  E(DA(G)).  Assume  that  (x,y)  6  E(D(V)).  Then  x  6  V[y)  so  (y-x)  (mod 
(m,n)  6  A  which  implies  that  y  =  (x+g)  (mod  (m,n))  for  some  g  €  A.  This  implies 
that  (x,y)  G  E(DA(G)).   Thus  (X,  V)  simulates  G. 

To  see  that  Gx  =  Cx,  recall  that  the  set  of  circulant  translations  on  X  is  a  group 
under  composition  which  is  isomorphic  to  G  and  therefore  to  the  group  of  color- 
preserving  automorphisms  of  DA(G)  for  any  A.  Thus,  a  mapping  <f>  :  X  — ►  X  is  a  color- 
preserving  automorphism  if  and  only  if  it  is  a  circulant  translation.  By  definition,  T 
€  Cx  if  and  only  if  t^xj(0(y))  =  tx(y)  holds  for  every  x,y  6  X.    Thus  Cx  =  Gx. 

Q.E.D. 
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In  a  similar  fashion,  the  set  of  all  G-templates  can  be  shown  to  be  isomorphic  to 
the  group  algebra  of  G  over  C. 

Lemma  3.46.  Assume  that  (X,/?)  simulates  G.  Let  A  6  Cx  and  define  T  G  Lx  by  T(e)  = 
A  and  T(x)  =  </>x_1(A)  ~  {  (zia(^x-1(z))  }  where  <f>x  is  the  color  preserving  automorphism 
corresponding  to  x.    The  template  T  is  a  G- template. 

Proof. 

Let  x,  y  €  X  and  let  <f>z  be  a  color  preserving  automorphism  of  D(R).    Note  that 

since  the  mapping  u  — +  <f>u  is  a  group  isomorphism.    Therefore,  by  definition, 

+&i+fr))  =  fa«(*)rV,(y)  =  («r^y  =  x-V  -  ^(y)  • 

Hence 

^,(x)(^(y))  =  a(^^x)(^(y)))  -  ^(y))  =  tx(y) 
which  shows  that  T  is  a  G-template  and  therefore  proves  the  lemma. 

Q.E.D. 

Theorem  3.47.  The  set  Gx  is  an  algebra  over  C. 

Proof. 

Let  S,T  E  Gx  and  r  6  C.  Since  addition  and  scalar  multiplication  are  defined 
pointwise,  S+T  and  rS  6  Gx.  The  templates  O  and  E  are  G-templates.  Hence,  we  only 
need  to  show  that  Gx  is  closed  under  ©  since  the  other  properties  required  of  an  alge- 
bra are  inherited  from  Lx.  Let  R  =  S©T,  x,y  6  X,  and  <f>  be  a  color  preserving  auto- 
morphism of  D(R).    By  definition  of  ©, 
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Since  S  and  T  are  G-templates, 

^-1r(^(x))=r(x) 

and 

ttfxjW  =  tx(^_1(*))      and      sz(^(y))  =  s,.1(r)(y) 
Hence 


r«*)My))  =     E    tx(r1(z))s0-I(z)(y)  = 

*€7Wx)) 

=        E       tx(z)s,(y)  = 
=     E   tx(z)sz(y)    =   rx(y). 

zET(x) 

which  shows  that  S©T  G  Gx  and  proves  the  theorem. 


Q.E.D. 


Definition  3.48.  The  group  algebra  of  G  =  X  over  C,  denoted  C[G],  is  defined  to  be  the 
set  C[G]  =  {  (a(x0),  a(xj),  .  .  .  ,  a(xn_1))  :  a(xj)  G  C  }  equipped  with  the  usual  vector 
space  operations  of  addition  and  scalar  multiplication;  C[G]  is  also  equipped  with  a  vec- 
tor multiplication  defined  by  a  convolution  which  is  tied  to  the  group  structure.  That  is, 
if 

a  =  (a(x0),  a(Xl),  .  .  .  ,   a(xn_1)),   K  =  (b(x„),  b(xx),  .  .  .  ,    b(xB_1))  G  C[G], 

then  "c  =  "a**b  is  defined  by 

c(Xi)  =   E  a(u)b(u-1x). 
uex 

Remark  3.49.  Note  that  C[G]  is  isomorphic  to  C  as  a  vector  space  but  not  necessarily 
as  an  algebra. 
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The  following  example  helps  to  put  the  group  algebra  into  perspective. 

Example  3.50.  Let  G  =  Zn  be  the  cyclic  group  of  order  n  written  additively.  Thus,  if 
u  6  G,  then  u"1  =  -u.   If  "c  =  a*b*  then 

c(x)  =  £  a(u)b((x-u)(mod  n)). 

u€G 

Thus,  the  group  algebra  in  this  case  is  an  algebraic  object  which  models  standard  cyclic 
convolution,  or  equivalently,  polynomial  multiplication  mod  xn-l.  If  we  take  G  =  Z  then 
we  obtain  standard  linear  convolution. 

Theorem  3.51.  Assume  that  (X.,R)  simulates  G.  Gx  is  isomorphic,  as  a  ring  and  as  a 
vector  space  over  C,  to  C[G]. 

Proof. 

We  identify  X  with  G.  Recall  that  a  G-template  is  well  defined  by  defining  it  at 
one  point.    Define  F  :  Gx  — *  C[G]  by 

T(T)  m  (a(x0),  a(Xl),  .  .  .  ,    a^)),    where   T(e)  =  A. 
The  map  T  is  onto  since,  by  Lemma  3.46,  given  A  we  can  always  define  a  G- 

template  such  that  T(e)  =  A.    To  see  that  T  is  1-1  assume  that  T(T)  =  T(S).    Then 

T(e)  =  S(e)  which  implies  that  S  =  T.    Since  addition  and  scalar  multiplication  are 

defined  pointwise  on  both  Gx  and  C[G],  S,T  G  Gx  and  r£C  implies  that  T(S+T)  = 

T(S)  +  r(T)  and  rr(T)  =  r(rT).    Let  A  =  S(e),  B  =  T(e),  and  C    =  (S©T)(e).    By 

definition  of  ©,  c(x)  =     £   te(y)sy(x).  Denote  r(S)*r(T)  = 

y€T(e) 
(d(x0),  d(x1),  .  .  .  ,   d(xn_1)).     By     definition     of     multiplication      in      C[G],     d(x)     = 

J]  a(y)b(y_1x).    Since  S  6  Gx,  sy(x)  =  se(</>  -i(x))  =  se(y_1x)  =  b(y_1x).    Hence,  since 

yex 

te(y)  =   a(y),   c(x)  =      £   a(y)My-1x)  =    Yj  a(y)b(xy_1)  =  d(x)  which  shows  that 

yeT\e)  y€X 
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r(s)*r(T)  =  r(s©T). 

Q.E.D. 

Remark  3.52.  In  the  case  that  G  =  Zm  x  Zn,  T  =  T0  defined  in  Section  3.2. 

Group  algebras  arise  in  the  study  of  linear  representations  of  groups.  These 
representations  furnish  another  description  of  the  set  of  G-templates.  They  are  also  a 
source  of  necessary  and  sufficient  conditions  for  determining  the  invertibility  of  G- 
templates. 

Definition  3.53.  A  linear  representation  of  a  group  G  over  C  is  a  group  homomorphism 
a  :  G  — ►  Mk(C)  for  some  k  >  1.  A  linear  representation  of  the  group  algebra  C[G]  over 
C  is  an  algebra  homomorphism  a  :  C[G]  — >•  Mk(C);  that  is,  a  preserves  sums,  products, 
and  scalar  multiplications. 

Denote  G  =  {  x0,  x1;  .  .  .  ,  xn_j  }.  Note  that  for  each  k  £  [0,n-l]  we  can  define  a 
permutation  rk  6  Sn  by  the  rule  rk(i)  =  j  if  and  only  if  Xj  xk  =  Xj.  Henceforth,  we  use 
the  symbols  rk  to  denote  these  particular  permutations. 

Definition  3.54.  A  right  regular  representation  of  G  is  a  representation  a  :  G  — ►  A^k(C)  of 
the  form  a(xk)  =  Pr .  A  right  regular  representation  of  C[G]  is  a  representation 
a    :  C[G]  -►  Mk(C)  of  the  form 

n-l  n-1 

a*((a(x0),  a(Xl),  .  .  .  ,   a(xn_1)))  =  2^Xi)«(xO  =  Ea(xi)Prk 

i=0  i=0 

where  a  is  a  right  regular  representation  of  G. 

It  is  a  fact  that  right  regular  representations  are  isomorphisms.  Note  that  there  is  a 
different  right  regular  representation  of  G  for  every  different  ordering  of  G.  Left  regular 
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representations  defined  relative  to  different  orderings  are  equivalent  up  to  permutations 

[19]. 

Remark  3.55.  Any  representation  a  of  G  can  be  extended  to  a  representation  a  of  C[G] 
in  a  fashion  similar  to  that  in  Definition  3.54.  In  fact,  G  can  be  considered  to  be  embed- 
ded in  C[G]  via  the  mapping  Xj  — ►  "ei+1  where  "ei+1  represents  the  ith  element  of  the  stan- 
dard basis.  Hence,  we  can  define  a*(Sj)  =  a(x;)  and  then  extend  by  linearity.  It  can  be 
shown  that  the  multiplication  is  preserved  [27].  Henceforth,  we  use  the  same  name  for 
either  representation,  that  is,  we  take  a   =  a. 

Theorem  3.56.  Let  X  —  {  x0,  xh  .  .  .  ,  xn_j  }  be  any  ordering  of  X  and  let 
ty  :  Lx  — ►  A/n(C)  be  defined  relative  to  this  ordering.  Let  a,  £  C[G]  and  let  T?  denote  the 
G-template  associated  with  It,  that  is,  T^  =  r_1(a),  and  Mg  the  value  of  the  right  regular 
representation  at  a,  that  is,  M^  =*=  a(a),  where  a  is  the  right  regular  representation  of 
C[G].    Then  *(TJ  =  Mt  ,  that  is,  *  =  oT. 

Proof. 

Let  A  =  T^e).  Then  tx(xj)  =  te(</>  "^Xj))  =  a(x;-1Xj).  Hence,  ^(T^)  =  (n^)  where 
^ij  =  a(xi"1Xj).  Denote  M^.  =  (7;j)  and  P_  =  (pif')  for  k  6  [0,n-l].  Assume  that  pj|k'  = 
1  for  some  triple  (i,j,k).  Then  j  =  rk(i)  or  X;Xk  =  Xj.  Furthermore,  if  pj]  '  =  1,  then 
XjXh  =  Xj  so  xh  =  Xi_1Xj  =  xk  which  implies  that  h  =  k.  Thus,  by  definition  of  the 
right  regular  representation  of  C[G],  each  7jj  is  of  the  form  a(xk)  for  some  k.  In  fact,  7y 
=  a(xk)  if  and  only  if  pjj  '  =  1  which  in  turn  is  true  if  and  only  if  XjXk  =  Xj.  Thus,  7^ 
=  a(x;-1Xj)  =  tx.(xj)  =  /Zjj  which  implies  that  M^.  =  ^(T^.)  which  is  the  desired  result. 

Q.E.D. 
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An  alternative  statement  of  Theorem  3.56  is  that  the  diagram  in  Figure  3.3  com- 
mutes if  ty  and  a  are  computed  relative  to  the  same  ordering  of  G. 


Gx  >  Mn(C) 

V 

C[G] 


Figure  3.3.  Commutative  diagram  representing  Theorem  3.56. 

Definition  3.57.  We  say  that  T  €  Lx  is  an  elementary  template  if  and  only  if  for  every 
x,y£X,  tx(y)£    {0,1  }. 

Lemma  3.58.  Assume  that  T  £  Gx  is  an  elementary  template.  The  sets  {  T(x)  :  x  €  X  } 
are  all  translates  of  some  set  S  C  X,  that  is,  there  exists  S  C  X  such  that  for  every 
x,y  £  X,  xS  =  T(x)  where  xS  =  {  xs  :  s  £  S  }  .  Furthermore,  for  every  S  C  X,  there 
exists  a  unique  elementary  template  T  £  Gx  such  that  for  every  x,y  £  X,  xS  =  T[x). 
We  say  that  T  (  or  S  )  is  induced  by  S  (  or  T  ). 

Proof. 

Define  S=  7(e)  and  let  x  £  X.   Then 

x5  ={y£X:y  =  xs  for  some  s  £  S  }  = 

=  {  y  £  X  :  x-"y  £  T(e)  }  = 

=  {  y  £  X  :  te(x-V)  ^  0  }  = 

=  {  y  £  X  :  tx(y)  ^  0  }  =  T(x)  . 
The  converse  follows  immediately  from  Lemma  3.46 
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Q.E.D. 


Remark  3.59.  The  image  T(e)  is  the  graph  of  the  characteristic  function  of  S.    Hence,  by 

Theorem  3.56,  ^(T)  =  £]a;(s)  where  a  denotes  the  right  regular  representation  of  G. 

ses 


Remark  3.60.  Such  a  set  S  exists  for  any  G-template. 

We  now  give  the  definition  of  the  discrete  Radon  transform. 

Definition  3.61.  Let  G  be  a  finite  group  and  let  5  C  G.    For  every  f  :  G  — *  C  define 

f  :  G  — +  C  by  f(t)  =   £]  f(s)-    The  transform  f  — ►  f  is  called  the  discrete  Radon  transform 

set5 

based  on  S. 

Theorem  3.62.  Assume  that  (X,i?)  simulates  G.  Let  S  C  G  and  let  T  6  Gx  be  induced  by 
S.    The  discrete  Radon  transform  based  on  S  is  invertible  if  and  only  i/T  is. 

Proof. 

We  identify  X  with  G.    Let  f  :  X  — ►  C  and  let  F  be  the  graph  of  f.    By  definition  of 

the  discrete  Radon  transform,  f(x)  =    Yj  f(s)  =     S  f(s)-  Hence,  if  we  denote  by  F  the 

sexS  sei\x) 

graph  of  f,  then  we  have  that  F  =  F©T.  Therefore  the  discrete  Radon  transform 
based  on  S  is  invertible  if  and  only  if  the  mapping  F  — ►  F©T  is  invertible.  This  will  be 
the  case  if  and  only  if  T  is  invertible. 

Q.E.D. 

The  following  definitions  are  expressed  in  terms  of  group  representations.  They  can  just 
as  easily  be  made  for  representations  of  group  algebras  and  we  will  consider  this  to  be 
done. 
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Definition  3.63.  Let  a  :  G  -+  Afk(C)  be  a  representation  of  G.    We  say  that  a  is  reducible 
if  and  only  if  there  exists  a  nonsingular  matrix  M  G  Mk(C)  such  that  for  every  x  G  G 


A(x)    B(x) 
0      C(x) 


a(x)   =   M 
If  a  is  not  reducible,  then  we  say  that  a  is  irreducible. 


M-1  . 


Definition  3.64.  We  say  that  two  representations  a  and  /?  of  G  are  similar  if  and  only  if 
there  exists  an  invertible  matrix  M  £  Mk(C)  such  that  for  every  x  G  G, 
a(x)  =  MflxJNT1. 

Fact  3.65.  If  a  :  G  — ►  Mn(C)  is  a  right  regular  representation  of  G,  then  there  exists  a 
nonsingular  matrix  M  and  irreducible  representations  ah  a2,  .  .  .  ,  ak  of  G  such  that  for 
every  g  G  G 


a(g)  =  M 


«i(g) 


«2(s)      o 


«k(g) 


M" 


Moreover,  if  p  is  any  irreducible  representation  then  p  is  similar  to  at  for  some  i  [27,  29]. 
The  same  statement  holds  true  for  C[G] 

Theorem  3.66.  Let  S  C  G  and  let  T  G  Gx  be  induced  by  S.  The  discrete  Radon 
transform  based  on  S  is  invertible  if  and  only  if  p(T(T))  is  invertible  for  every  irreducible 
representation  p  of  C[G). 

Proof. 

Let  T(T)  =  p.    Assume  that  the  discrete  Radon  transform  based  on  S  is  invertible 
and  let  p  be  any  irreducible  representation  of  C[G].    By  Theorem  3.62,  T  is  invertible. 
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Let  a  be  a  right  regular  representation  of  G  and  let  ^  :  Lx  -*■  Mn(C)  be  defined  relative 
to  the  same  ordering  of  G  as  a.  By  Theorem  3.56,  a(p)  =  *(T)  which  implies  that  a(p) 
is  invertible.  By  Fact  3.65,  there  exists  a  nonsingular  matrix  M  £  Mn(C),  and  irreducible 
representations  alt  a2,  .  .  .  ,   ak  of  C[G]  such  that 


*(T)  =  M 


«i(p) 


a2tf)        0 
0 


«k(p) 


M"1  . 


Furthermore,  there  exists  an  i  G  [l,k]  such  that  p  =  ay  Since  ty(T)  is  invertible  if  and 
only  if  every  aj(p)  is,  it  follows  that  p(p)  =  p(r(T))  is  invertible. 

Conversely,  assume  that  p(r(T))  is  invertible  for  every  irreducible  representation  p 
of  C[G].  Then,  by  Fact  3.65,  ^(T)  is  invertible.  Hence,  T  is  invertible  so  by  Theorem 
3.62,  the  discrete  Radon  transform  based  on  S  is  invertible. 

Q.E.D. 

Note  that  the  proof  of  Theorem  3.66  yields  the  more  general  result  given  in  the  next 
corollary. 

Corollary  3.67.  Let  T  G  Gx-  T  is  invertible  if  and  only  if  p(T(T))  is  invertible  for  every 
irreducible  representation  p  o/C[G], 

Definition  3.68.  If  p  is  a  representation  of  G,  then  the  contragredient  representation  of  p 
is  the  representation  defined  by  ^(x)  s  p{'*~1Y-  Note  that  p(x)  =  ^(x-1)1  and  that 
every  representation  is  the  contragredient  representation  of  some  representation. 


Lemma  3.69.  Let  SdG,letTE  Gx  be  induced  by  S,  and  let  p  be  any  irreducible 
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representation  of  C[G].    The  matrix  p{T(T))  is  invertible  if  and  only  if  J]p{s  x)  is. 

ses 


Proof. 

n-l 

Let  a  =  T(T).    By  Remark  3.56,  we  can  write  p(t)  =   Y]a{x\)p{x\)-  Furthermore, 

i=0 
since  a  :  X  — ►  C  is  the  characteristic  function  of  S,  p(a)  =  £]/?(s).    Thus,  p(r(T))  =  p(&) 

seS 

is  invertible  if  and  only  if  £)/>(s)  is  invertible.  Moreover  £)/>(s)  ls  invertible  if  and  only  if 

'    ses  ses 

^(s-1)1  =  (XJ^s-1))1  is.    Since  a  matrix  is  invertible  if  and  only  if  its  transpose  is 

s£5  ses 

invertible,  we  may  conclude  that  the  lemma  is  true. 

Q.E.D. 


Corollary  3.70.  Let  S  C  G.    The  discrete  Radon  transform  based  on  S  is  invertible  if  and 

only  if  for  every  irreducible  representation  p  o/G,  the  matrix  J]  p(s~  )  is  invertible. 

ses 


Proof. 

By  Theorem  3.66,  the  discrete  Radon  transform  based  on  S  is  invertible  if  and  only 
if  for  every  irreducible  representation  p  of  G  the  matrix  p(F(T))  is  invertible.    By  Lemma 

3.69,  this  will  happen  if  and  only  if  YjPi5'1)  ^s  invertible. 

ses 

Q.E.D. 

Corollary  3.70  was  first  proved  by  Diaconis  and  Graham  using  different  methods 
[11]. 

Thus,  the  invertibility  of  the  discrete  Radon  transform  is  linked  to  the  invertibility 
of  G-templates.  In  the  case  of  circulant  templates  the  criteria  is  particularly  useful  since 
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the  irreducible  representations  of  abelian  groups  are  all  one-dimensional.  In  fact,  the 
irreducible  representations  of  Zm  x  Zn  over  C  are  the  mn  elements  of  the  set 
{  Wmwn  :  i,j  €  Z  }.    Hence,  if  p  is  an  irreducible  representation  of  C[G],  then  p{T(T))  = 

m-l  n-1 

E  XJto(iJ)a;mfa,n    =    Pt(wm>  wn>0)-    Recall   from    Section    3.2    that    the    block    circulant 

i=0j=0 

matrices  with  circulant  blocks  are  all  simultaneously  diagonalizable  by  the  Fourier 
matrices.  This  is  a  special  case  of  Fact  3.65.  By  the  aforementioned  observations  and 
Theorem  3.56,  the  numbers  pt(wm,  wnj,0)  must  be  the  eigenvalues  of  the  matrix  ^(T), 
which  is  block  circulant  with  circulant  blocks. 

It  is  interesting  to  point  out  that  using  the  above  criteria  for  circulant  templates 
can  yield  conditions  for  the  invertibility  of  some  particular  templates  very  easily  whereas 
for  others  it  is  not  so  easy.  For  example,  if  we  let  M  denote  the  Moore  configuration, 
which  is  just  a  3  x  3  square  at  every  point,  then  one  can  use  a  few  trigonometric  mani- 
pulations to  show  that  M  is  invertible  on  the  n  x  n  array  X  if  and  only  if  3  does  not 
divide  n.  If  one  considers  the  von  Neumann  configuration,  however,  it  does  not  appear  to 
be  as  easy.  A  computer  program  has  been  written  which  suggests  that  the  discrete 
Radon  transform  based  on  the  von  Neumann  configuration  on  X  is  invertible  if  and  only 
if  5  divides  n.    We  have  not  been  able  to  prove  the  necessity  of  this  conjecture. 

As  we  mentioned  previously,  G-templates  have  potential  applications  to  parallel 
processing  since  they  are  translation  invariant  with  respect  to  Cayley  color  graphs  which 
are  being  studied  as  possible  models  for  parallel  computer  architectures.  As  mentioned 
in  the  Introduction,  many  parallel  computer  architectures  are  built  to  implement  transla- 
tion invariant  transformations  since  it  is  generally  easier  to  construct  and  control  such 
devices.  It  is  reasonable  to  expect  that  G-templates  will  furnish  a  good  algebraic  descrip- 
tion of  computing  environments  based  on  Cayley  networks. 
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The  topic  of  group  representations  has  been  in  existence  since  about  1896  when 
Frobenius  first  defined  group  characters  [29].  It  is  a  well  studied  area  and  has  seen 
applications  to  physics  and  chemistry  as  well  as  in  group  theory.  It  is  fascinating  to 
think  that  a  subject  developed  in  the  early  part  of  the  century  could  have  applications 
to  parallel  computer  architectures.  It  is  possible  that  this  will  be  another  case  where  the 
mathematics  precedes  the  application  by  a  significant  time  period. 

We  have  defined  G-templates  and  shown  that  they  are  generalizations  of  circulant 
templates.  We  have  also  shown  that  the  set  of  all  G-templates  is  an  algebra  which  is  iso- 
morphic to  the  group  algebra  of  G  over  C.  We  then  described  the  relationship  between 
G-templates  and  the  discrete  Radon  transform. 


PART  II 
LOCAL  DECOMPOSITIONS  OF  FOURIER  TEMPLATES 


In  Part  II  we  focus  on  deriving  local  decompositions  of  the  Fourier  template  with 
respect  to  the  von  Neumann  configuration.  We  recall  that  the  von  Neumann 
configuration  is  the  template  configuration  function  V  :  X  — ►  2  defined  on  the  m  x  n 
array  X  by  V(x,y)  =  {(x,y),  (x+l,y)(mod  (m,n)),  (x-l,y)(mod  (m,n)),  (x,y+l)(mod  (m,n)), 
(x,y-l)(mod  (m,n))}.  V  models  a  mesh-connected,  or  rectangular,  array  of  processors 
with  nearest  neighbor  connections.  Henceforth,  when  we  refer  to  local  we  shall  mean 
with  respect  to  V.  We  shall  also  take  X  to  be  an  m  x  n  array  where  m  and  n  are  positive 
integers  but  are  not  restricted  in  any  other  way. 

We  use  matrix  algebra  associated  with  the  fast  Fourier  transform  (FFT)  and 
numerical  linear  algebra  to  derive  local  decompositions  of  the  Fourier  matrices.  Due  to 
the  existence  of  the  isomorphism  ty  :  Lx  — *  Mnn(C),  these  decompositions  can  be  inter- 
preted as  decompositions  of  the  corresponding  templates. 

We  say  that  a  template  T  6  Lx  is  separable  if  MT  =  MjglMo  =  (M^  In)(Im®M2) 
where  Mx  and  M2  are  m  x  m  and  n  x  n  matrices  respectively.  We  have  already  observed 
that  the  Fourier  template  has  this  property.  This  fact  implies  that  it  is  sufficient  (but 
not  necessary)  to  seek  decompositions  which  are  local  with  respect  to  restrictions  of  the 
von  Neumann  configuration  to  linear  arrays.  If  T  is  separable,  then  T  =  T2©Tj  where 
T2  =  ^(^(gJMo)  and  T1  =  ^(Mi®1..)-  If  A  €  Cx,  then  the  mapping  A  ->  A©T2 
represents    an    identical    computation    being    done    on    each    row    and    (A©T2)©Tj 
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represents  an  identical  computation  being  done  along  each  column  of  the  result.  Thus,  it 
is  sufficient  to  construct  decompositions  of  Tj  and  T2  with  respect  to  the  configurations 
Rx  and  R2  defined  by 

#i(x,y)  =  {  (x,y),  (x+l,y)(mod  (m,n)),  (x-l,y)(mod  (m,n))  } 
and 

^2(x,y)  =  {  (x,y),  (x,y+l)(mod  (m,n)),  (x,y-l)(mod  (m,n))  }. 
This  can  be  achieved  by  factoring  the  matrices  M1  and  M2  into  products  of  tridiagonal 
matrices.  For  this  reason,  we  concern  ourselves  with  developing  decompositions  of  the 
one-dimensional  Fourier  matrices  into  products  of  tridiagonal  matrices  and  permutation 
matrices.  We  call  such  decompositions  tridiagonal  decompositions.  We  will  describe  an 
algorithm  for  computing  the  permutations  locally.  Henceforth,  we  shall  refer  to  the 
one-dimensional  Fourier  matrices  simply  as  Fourier  matrices. 

Part  II  is  divided  into  two  chapters.  In  the  first  chapter,  chapter  4,  we  use  matrix 
identities  associated  with  FFTs  to  develop  tridiagonal  decompositions  of  the  Fourier 
matrices.  In  the  second  chapter,  chapter  5,  we  develop  a  numerical  method  for  comput- 
ing tridiagonal  decompositions  of  the  Fourier  matrices. 


CHAPTER  4 

TRIDIAGONAL  DECOMPOSITIONS  OF  FOURIER  MATRICES  BASED  ON 

MATRIX  ALGEBRA  ASSOCIATED  WITH  THE  FFT 


In  this  chapter,  we  use  matrix  identities  associated  with  FFTs  to  develop  tridiago- 
nal  decompositions  of  Fourier  matrices.  We  describe  methods  for  decreasing  the  number 
of  parallel  operations  and  mapping  "large"  DFTs  onto  smaller  arrays  using  two- 
dimensional  techniques.  We  investigate  the  parallel  time  complexity  of  the  algorithms 
implied  by  the  decompositions. 

In  the  first  section,  we  use  identities  developed  by  Rose  [48]  to  derive  decomposi- 
tions of  Fn  into  products  of  permutation  matrices  and  block  diagonal  matrices  in  two 
different  ways.  The  diagonal  blocks  of  these  matrices  are  of  the  form  Fp  where  p  is  a 
prime  divisor  of  n.  We  then  derive  a  matrix  identity  associated  with  the  Rader  prime 
algorithm  and  the  circular  convolution  theorem  [6,  10,  31].  This  identity  is  used  to  fac- 
tor the  Fourier  matrices  of  prime  order  into  products  of  permutation  matrices,  Fourier 
matrices  of  lower,  composite  order,  and  certain  sparse  triangular  matrices.  We  show 
how  the  sparse  matrices  that  appear  can  be  factored.  Taken  together,  these  theorems 
provide  an  algorithm  for  deriving  tridiagonal  decompositions  of  Fn  for  arbitrary  n. 

In  the  second  section,  we  derive  expressions  for  the  number  of  parallel  arithmetic 
steps  and  upper  bounds  on  the  number  of  parallel  permutation  ,  or  data  manipulation, 
steps  required  to  implement  the  algorithms  resulting  from  the  decompositions.  We  pro- 
vide a  table  giving  upper  bounds  on  the  number  of  steps  required  to  implement  these 
algorithms  on  a  mesh-connected  array  for  certain  image  dimensions.  Since  many  permu- 
tation   matrices   appear   in    the   decompositions,    data   manipulation   steps   represent    a 
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significant  part  of  any  algorithm  derived  from  these  decompositions.  Data  manipulation 
steps  are  important  when  trying  to  implement  algorithms  on  massively  parallel  architec- 
tures since  the  processors  do  not  have  a  global  shared  memory.  Therefore,  particular 
attention  is  paid  to  this  aspect  of  the  parallel  time  complexity. 

4.1.  Derivation  of  FFT-Based  Decompositions 

We  first  establish  notation  and  state  identities  which  will  be  required  in  deriving 
the  tridiagonal  decompositions  based  on  FFT  identities.  Unless  otherwise  stated  we  con- 
sider matrices  as  ordered  from  0  to  n-1.  We  denote  by  Sn  the  group  of  permutations  of 
n  objects,  and  by  Zn  the  ring  of  integers  modulo  n.  We  think  of  Sn  as  acting  on  Zn.  Let 
m  6  Z  with  m  >  1  and  let  uim  =  exp(-27ri/m)  where  i  =  (-1)  '  .  Assume  for  the 
remainder  of  the  section  that  n  =  mk  where  m,k  >  1. 

Definition  4.1.  Define  <rmk  :  Zn— >ZB  by  the  following  rule: 

If  i  6  Zn  and    i  =  a+bm  with  0  <  a  <  m  and  0  <  b  <  k,  then  cmk(i)  =  ak-fb. 
crmk  is  called  a  shuffle  permutation. 

Definition  4.2.  Define  P(m,k)  £  Mn  by  P(m,k)  =  P^. 

P(m,k)  is  called  a  shuffle  permutation  matrix.    Note  that  P(m,k)  =  P(k,m)t  =  P(k,m)_1. 

Definition  4.3.  Define  the  n  x  n  diagonal  matrix  D(m,k)  by 

D(m,k)  =  blockdiag[Dj(m,k)],  j=0,l,...,k-l, 
where 

Dj(m,k)  =  diag(o;rjJ),  i=0,l,...,m-l,  j=0,l,...,k-l  . 
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These  matrices  are  called  twiddle  matrices  or  twiddle  factors. 

For  any  p,k  >  1  we  denote  D  k  =  D(pk_1,p)  and  Ppk  =  P(pk_1,p).    For  convenience,  we 

take  P(n,l)  =  In,  the  n  x  n  identity  matrix,  and  Ij  =  1. 

It  is  useful  to  have  other  descriptions  of  the  twiddle  matrices.  We  give  two  more 
here;  the  first  is  useful  for  expressing  matrix  identities  whereas  the  second  is  better  for 
implementation  purposes. 

Definition  4.4.  Let  A  be  any  m  x  m  matrix  and  define  Tk(A)  by 

Tk(A)  =  blockdiag^A1^2,  •  •  •  ,Ak_1]- 
Let 

Um  =  diag(l,wn,a>n2,  .  .  .  ,  wnm_1) 
and  define 

D(m,k)  =  Tk(nm)  . 

Lemma  4.5.  D(m,k)  =  D(m,k). 

Proof. 

The  lemma  follows  immediately  from  the  fact  that  Dj(m,k)  =  Q^. 

Q.E.D. 

Lemma  4.6.  If  D(m,k)  =  diag(d;),  then  dt  =  o>nrq,  where  r  and  q    are  the  unique  integers 

(\-t) 
satisfying  i  =  qm+r  with  0  <  r  <  m,  that  is,  r  =  i(mod  m)  and  q  =  -^ — '-. 

m 
Proof. 

By  Lemma  4.5,  d;  is  the  (r,r)  entry  of  the  diagonal  matrix 
Q^  =  diag(l,u;nVn2«,  .  .  .  ,  u^1*).    Thus,  dj  =  <«>. 
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We  now  state  some  basic  identities  concerning  Fourier  matrices,  Kronecker  pro- 
ducts, and  permutation  matrices  which  were  developed  by  Rose  [48].  We  will  use  these 
identities  as  building  blocks  for  some  of  our  derivations. 

Let  Cn  denote  the  n  x  n  circulant  permutation  matrix 

Cn  =  circ(0,0,...,0,l). 
For  any  integer  s,  let  Q(m,k,s)  denote  the  n  x  n  permutation  matrix 

Q(m,k,s)  m  P(k,m)Tk(C£)P(m,k). 
Facts. 

1.)  If  A  is  an  n  x  n  matrix  and  B  is  an  m  x  m  matrix,  then  P(n,m)(A®  B)P(m,n)=B®  A. 
2.)  (General  Radix  Identity) 

Fn  =  (Fm®Ik)D(k,m)(Im®Fk)P(k,m) 

3.)    (Twiddle    Free    Identity)    Assume    that    m    and    k    are    relatively    prime    and    let 
m    =  m-1(mod  k)  and  k*  =  k_1(mod  m).  Then 

Fn  =  Q(m,k,-k')(Fm<g>Fk)TJCkm*)P(k,m). 

We  now  use  these  identities  to  derive  decompositions  of  the  Fourier  matrices  into 
products  of  permutation  matrices  and  block  diagonal  matrices,  the  blocks  of  which  are 

Fourier  matrices  of  prime  order.    We  first  assume  that  n  is  not  a  power  of  a  prime. 

s       k 
Theorem  4.7.  If  n  =  TJpj  '  with  s  >  2  and 
i=l 
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J         k 

i-i        if  1  <  j  <  s 

1         if    j  =-1,0    ' 


Cj  =  n/nj  for  -1  <  j  <  s,  qj  =  Pj  J  for  1  <  j  <  s,  and  q0  =  1,  then  there  exists  permu- 
tation matrices  Q1(  Q2,...,QS_2,  o.ndY{  such  that 


n(i»J®Q?i+i)(i.J®F,+1®v.) 

j=0 


a-«®Fj 


n(In^®Q2(s-j)) 


Proo/ 


The  proof  is  by  induction  on  the  number  of  primes  s. 

Assume  s  =  2.  Then,  since  cx  =  q2,  nx  =  q1;  and  n0  =  1,  by  Fact  3.),  there  exists 

* 
permutation  matrices  Qx  and  Q2,  namely  Qj  =  Q(qi,q2,-q2*)  and  Q2  =  Tqi(Cq21)P(q2)q1), 

such  that  Fn  =  Qx(Fq  ®Iq  )(Iqi®  FqJQ2  which  is  exactly  the  statement  of  the  theorem. 

Assume  that  the  theorem  is  true  for  all  integers  with  prime  factorization  of  length 

j      it 
less  than  s.  In  particular,  the  theorem  is  true  for  cv  Thus,  letting  n}  =  TJpj     and  n:  = 


i=2 


1,  we  have  that  there  exists  Q3,-.-,Q2s-3iQ4>  •  •  •  ,  Q2S-2  sucn  that 


8-2 


n(ifij®Q2J+i)(ifij®Fqj+i(8)ic.+i) 
j=i 


Moreover,  there  exists  Qj  and  Q2  such  that 


d«    ®FJ 


s-2 


n(ifiH_®Q2(s-j)) 


j=2 


Fn  =  Qi(Fqi®ICl)(Iqi®FCi)Q2. 
Putting  these  equations  together  and  observing  that  njqj  =  nj  yields 


Fn  =  Qi(Fqi®g 


90 


kfi 


a*  ®F<J 


n(iAH_®Q2(s-j)) 


Q2  = 


j=2 


=  (Ino®Ql)(In0®Fqi®Iei) 


s-2 


na»J®Q?i+i)a»J®F<u„®ieJ+I) 

j=l 


dn.^FJ 


n(InH.1®Q2(H)) 

j=2 


(In0®Q2)  = 


s-2 


j=0 


(W.®F<J 


n(InH,®Q2(s-j)) 

j=l 


which  is  the  desired  result. 


Corollary  4.8.  Using  the  notation  of  Theorem  4.7  we  have  that 


Q.E.D. 


Fn  = 


s-2 


(I»j8P(q8-i,V1))(L  ^FJ 


n(In^®Q2(H)) 

j=l 


where 


Rj  =  (InM®P(qjlcj))(Inj®Q2j+1)(Inj®P(cj+llqj+1)) 


and 


Ei  =  (IB,ei   ®FQ.  ). 

Proo/. 

Use  the  fact  that 

In.®Fq.+i®ICj+i  =  (In.®P(cj+llqj+1))(IrjCj+iOFq.+i)(Inj®P(qj+1,Cj+1)) 
in  Theorem  4.7. 


Q.E.D. 
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One  can  also  use  the  general  radix  identity  in  a  similar  way  to  obtain  alternative 
decompositions. 


Theorem  4.9.  Assume  that  n  =  TJkj  is  any  nontrivial  factorization  of  n  as  a  product  of 

j=i 

positive  integers.   Let  xi\  =  Hkj  ,    n0  =  1  ,    and  q  =  — .    Then 

j=i  ni 


n(InH®P(cj,kj))(Inj.iej®Fkj)(IIlH®P(kjlCj))(IllH®D(cjfkj)) 


['» 


®Fk 


n(InH®P(cH+l'kH+l)) 


j=2 


Proo/. 

The    proof    is    by    induction    on    s,    the    number    of    factors.     If   s    =    2,    then 
nj  =  k1  and  cx  =  k2.    Therefore,  the  statement  of  the  theorem  is  that 

=  P(k2,k1)(Ik2®Fki)P(k1)k2)D(k2lk1)(IklOFk2)P(k2;k1)  = 

=  (Fkl®Ik2)D(k2,k1)(Ikl<g)Fks)P(k2,k1) 

which  is  precisely  the  general  radix  identity. 

Assume  that  the  theorem  is  true  for  2,3, ...s-l  for  some  s  >  2  and  that  n  is  as  in  the 
statement  of  the  theorem  for  this  s.    With  this  assumption,  the  theorem  is  true  for 

s  .  A  n;  +  1  Cj 

Cj  =  JTkj.    Therefore,  letting  fij  =  — — ,  n0  =  1,  and  6j  =  —  for  j  6  [l,s-l]  and  using 

j=2  M  ii- 

the  facts  that  fij^  =  nj+1  and  6j  =  Cj+1  we  have  that 
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s-2 


II  (IflH®  Pft.kj+i))^®  Fkj+1)(IfiH<8>  PCkj+x^XI^®  D(cj)kj+1)) 


[^®F^] 


s-l 


j=2      "H 


Hence 


Ikl®FCi  =  Ini®FCi  = 


s-2 


n(IVin®P(cj+1)kj+1))(IV:niCj+®FkJ(IflHn®P(kj+1,cj+^)(Ifij_in®D(cj+llkj+1)) 


-  [K+P\ 


s-l 


n(i,1,^®p(VH.M 


j=2 


s-l 


IKInH®P(Cj,ki)XInJ.1eJ®Fk)(I       ®P(kjlCj))(I1I     ^Dfcj.kj)) 

j=2 


InM®Fk, 


s-l 


n(InH®P(Cs-j-l,ks_H)) 


j=2 


Therefore 


Fn  -  (Fkl®yD(c1,k1)(Ik1®FclJP(ci,k1) 


=  P(ellkI)(l,1®Fkl)P(ki>c1P(cllk1)(l1[1®Pei)P(cllkj)  = 


=  (In0®P(c1,k1))(InoCiOFki)(Ino®P(k1)c1))(Ino®D(cllk1)) 


s-l 


nanH®P(Cj,kj))(InHej®Fk)(I       ®P(kj|Cj))(I       ®D(Cj,kj)) 

j=2 
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['" 


s-l  *l 


s-1 


IKInH®P(cH-i.kH-i)) 


j=2 


(CWcj.kj))  ] 


s-l 


n(InJI®P(cjIkj))(Inj_iej®Fk)(IIl    ®P(kjIcj))(IBH®D(cjlkj)) 

j=l 


V^Fk. 


IKInH®P(cH  +  l.ks-j+l)) 


J=2 


which  is  the  desired  result. 


Corollary  4.10.  Let  n,n;,  and  Cj  be  as  in  Theorem  4.7  and  denote 


Pi  =  P(ci+i,Pi+n     and      D,  =  D(ci+llPi^). 


Then 


Q.E.D. 


s-l 


n  (I»H®  PH^nnq®  F    *i)(InH®  Pjll)(InH®  DjJ 

j  =  l  pJ 


Infcl®F_  v 


n(inH<s>ps-j) 


j=2 


Corollary  4.11.  Ifk    >    2,  then  Fpk  =  Gp(Ipk-i®Fp)Hp    tfl/iere 


GP=   S  f(Ipm®Pph»)(Ipt.®Fp)(I111<g>PtlHI0(Ip«®DlH„)l 

m=0  L  -1 


anrf 


hp=  n 


m=2 


Ip*-m®Ppr 


Note  that  Corollary  4.11  results  in  tridiagonal  decompositions  in  the  case  p  =  2. 
This  is  essentially  the  decomposition  developed  by  Pease  [40]. 
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To  obtain  tridiagonal  decompositions  of  the  Fourier  matrices  of  arbitrary  size, 
methods  based  on  identities  other  than  the  twiddle  free  and  the  general  radix  identities 
must  be  used.  Towards  this  end  we  formulate  a  matrix  identity  associated  with  the 
Rader  prime  algorithm. 

Let  a  G  Zp  with  cv  7^  0,1.  Let  RijP(cO  and  R2i?{a)  be  the  permutation  matrices 
corresponding  to  the  permutations  on  Zp  defined  by: 


Q 


ri-  0 


1 
0 


CT2:0 


a 

0 


Let  Cj  =  gj"*'  -1  and  let  Cp  be  the  p-1  x  p-1  circulant  matrix  Cp  =    circ(co,c1(  ...  ,cp_2). 
Let  A  be  any  n  x  n  matrix.  We  denote  by  A'm)  the  n+m  x  n+m  matrix 


A(n 


Im     0 

0    A 


We  denote  by  Up  the  p  x  p  matrix  with  all  one's  in  the  first  column,  one's  down  the 
diagonal,  and  zeroes  elsewhere.  For  example  if  p  =  5,  then 


U5  = 


1 

0   0  0 

0" 

1 

1    0   0 

0 

1 

0   1   0 

0 

1 

0  0    1 

0 

1 

0   0   0 

1 

Finally,  let  Fp  denote  the  pxp  matrix  defined  by  Fp  =  blockdiag[l,  (  wpIJ  -1  )]  where  i,j 
6  [l,p-l].  The  matrix  formulation  of  the  Rader  prime  algorithm  is  then  given  in  parts 
1.)  and  2.)  of  the  following  theorem. 

Theorem  4.12.  Let  fp(x)  be  the  polynomial  defined  by 


fp(x)  =  c0+c1x+c2x2+  ■  ■  ■  +cp_2xp  2 
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and  Ap  the  p  x  p  matrix 


Ap  m  diag(l,fp(a;pi:11)),    i=l,2,...,p-l. 


Then 

1.)  Fp  =  UpFpUpl 

2.)  Fp  =  RliP(a)C(%2». 

3.)  C«  =  Fp(ilApFp*W 

Proof. 

The  proof  of  1.)  consists  of  a  transparent  matrix  multiplication  and  was  observed 
by  Parlett  [39].  The  identity  3.)  follows  from  the  matrix  form  of  the  convolution 
theorem  (Section  3.2). 

To  prove  2.)  we  write 


Ri,P(a)   = 


R 


and      R2)P(a)   = 


where  R  and  S  are  the  permutation  matrices  corresponding  to  the  restrictions  of  a1  and 
cr2  to  the  set  {  l,2,...,p-l  }  and  the  blank  spaces  represent  zero  entries.    We  also  denote 


where  F  =  (u>plj  -1).  Since  R  1  =  Rl  and  S  l  =  S\  it  is  sufficient  to  show  that 
R'FS1  =  Op.  If  we  denote  A  =  (ay)  =  RlFSl  where  i,j  6  [l.p-1],  then  ^  =  J1'1^2^  -1. 
Note  that  if  j  6  [l,p-l],  then  cr1_1(l)cr2(j)  =  aa"j  =  apa"J  which  implies  that  a^  =  Cj_j. 
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Thus,  the  first  row  of  A  is  equal  to  the  first  row  of  Cp.  The  fact  that  A  is  circulant  fol- 
lows from  the  identity  af^i+k^O+k)  =  ai+kcT(j+k)  =  a'a~']  =  frf^iJffaQ)-  Therefore,  A 
=  Cp  and  the  theorem  is  proved. 

Q.E.D. 

Putting  these  identities  together  yields  Fp  =  UpRx  p(a)Fp^jApFp*i11'R2;p(a)Upt  where 
*  denotes  the  conjugate  transpose. 

Observe  that  if  A  and  B  are  any  n  x  n  matrices,  then  (Im®  A*)  =  (Im®  A)    and 
A(m)B(m)  =  (AB)(m)    Thereforej  jf  Fp_,  =  TJTj  is  a  tridiagonal  decomposition  of  Fp_1( 

then  Im<g>Fp<i}  =  Im®(]TTi)(1)  =  Im®n(Ti(1))  =  II(Im®Ti<1))  is  a  tridiagonal  decom- 
position of  Im®Fpfi{.  Hence,  if  we  can  determine  tridiagonal  decompositions  of  Up  for  p 
an  odd  prime,  then  we  can  proceed  inductively  to  factor  Fp'i|  until  p  =  2  or  p  =  3. 
Thus,  we  need  only  determine  methods  for  factoring  the  matrices  Up  for  p  an  odd  prime 
to  complete  our  description  of  the  FFT-based  decompositions. 

Theorem  4.13.  Let  p  be  an  odd  prime.  Then 


Ur 


2 

n 

k=p-l 


1     0      . 

1     1      0 

Ik       o      0     j' 

0  -1     1 

0,...,0,1    1       0 

.     0    -1 

0           0    Ip_k_x 

.      .      0 

0    0     0 

0    0 

0     . 

1     0 

-1  1 

Proof. 


Let 
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U  = 


2 

n 

k=p-l 


1     0 

Ik        o      0     V 

1     1      0 
0   -1     1 

0,...,0,1    1       0 

.     0    -1 

0           0    Ip_k_! 

.      .      0 

0    0     0 

0 

0 

0 

1 

0 

-1 

1 

-  K) 


and  denote 


Ik       o      0 

vk  - 

0,...,0,1    1       0 

0           0    Ip_k_! 

and 


V    = 


1 

0 

1 

1 

0 

0 

-1 

1 

0 

-1 

0 

0 

0 

0 

0 

0 

0 

1 

0 

-1 

1 

Let  x  =  (x0)xl7  ...  ,Xp_iY  £  Cp.    We  show  that  Upx  =  Ux.    Note  that 

Upx  =  (x0,  Xi+xo,  x2+x0,  .  .  .  ,   xp_1+x0)t. 
Now 

Vx  =  (x0)  xq+Xj,  x2-Xj,  x3-x2,  .  .  .  ,  Xp.j-Xp.a)1. 
Furthermore,  if  y  =  (y0,yi,  ■■•  ,yp_i)t  6  Cp,  then 

vky  =  (yo.yi.-,yk-i,  Jt+Jk-v  yk+i.  •  •  • .  yP-i)1- 

Thus, 

V2(Vx)  =  (X0,  Xq+Xj,  x0+x2,  x3-x2,  .  .  .  ,  Xp_1-Xp_o)t 


and  one  can  show  by  induction  that 
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Vj(Vj_1Vj_2  ■  •  ■  V2Vx)  =  (x0,  x0+Xl,  .  .  .  ,   x0+Xj,  xj+1-Xj,  .  .  .  ,   xp_1-xp_2)t 

for  j  €  [2,p-l].  Hence,  we  may  conclude  that  Ux  =  Upx  for  every  x  €  Cp  which  implies 
that  U  =  Up. 

Q.E.D. 

A  graphical  representation  of  the  decompositions  is  given  in  Figure  4.1. 


*0 

*o 

*o 

x0 

x0 

Figure  4.1.    Local  Computation  of  U5. 
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We  have  described  how  the  problem  of  finding  tridiagonal  decompositions  of 
Fourier  matrices  can  be  solved  by  applying  matrix  identities  associated  with  the  FFT 
and  the  Rader  prime  algorithm.  Theorems  4.7-4.13  suggest  the  following  algorithm  for 
computing  tridiagonal  decompositions  of  Fn. 

1.)  Use  either  the  general  radix  or  twiddle  free  identity  to  factor  Fn  into  a  product 
of  permutation  matrices  and  block  diagonal  matrices  with  diagonal  blocks  of  the  form  Fp 
for  p  a  prime  divisor  of  n. 

2.)  Use  Theorem  4.12  to  factor  Fp  for  the  odd  prime  divisors  p  of  n. 

3.)  Use  Theorem  4.13  to  factor  Up  and  Upl. 

4.)  Go  to  1.)  for  each  p-1  with  p  7^  2,3. 

5.)  Continue  until  all  the  prime  divisors  are  2  or  3. 

In  the  Appendix,  we  present  computer  programs  based  on  these  ideas  which  are 
written  in  an  extended  FORTRAN  77  which  provides  for  the  use  of  image  algebra  opera- 
tions. 

We  briefly  examine  the  effects  of  applying  the  identities  directly  to  the  two- 
dimensional  Fourier  matrices.  One  can  use  the  resulting  identities  to  map  two- 
dimensional  DFTs  onto  arrays  which  have  dimensions  smaller  than  the  image  dimensions 
or  to  reduce  the  number  of  parallel  multiplication  steps  that  are  required  by  a  factor  of 
two  when  using  the  general  radix  identity. 

Assume  for  simplicity  that  X  is  an  n  x  n  array  and  that  n  =  mk.  Using  the  gen- 
eral radix  identity  we  can  write 

Fnxn  =  Fn®Fn  = 
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=    "(Fm<8>Ik)D(klm)(Im®Fk)P(k,m)]<g)  [(Fm<g>Ik)D(k,m)(Im®Fk)P(k,m)  ]  = 

=  Pi(Ik2®  (Fm®  Fm))P2(D(k,m)®  D(k  ,m))P3(Im2®  (Fk®  Fk))P4 
where  P1,P2,P3,  and  P4  are  permutations  matrices.  This  identity  expresses  Fnxn  as  a  pro- 
duct of  diagonal  matrices,  permutation  matrices,  and  two-dimensional  DFTs  of  smaller 
order.  This  can  be  used  to  map  an  n  x  n  DFT  onto  a  smaller  array.  It  would  not  lead  to 
a  good  algorithm  in  the  event  that  the  array  was  big  enough  for  the  image  due  to  the 
increased  number  of  permutations.  One  can  rewrite  the  equation  as  follows: 

F„xn  =  ((Fm®Ik)®In)(In®(Fm®Ik))(D(k)m)<8)D(k(m)) 
((Im®Fk)®In)(In®(Im®Fk))(P(m(k)(g)P(ni,k)). 

This  decomposition   still  compresses   two  multiplication   steps   into  one   in   the   matrix 

(D(k,m)®D(k,m))  but  uses  the  separability  properties  of  the  Fourier  matrices  to  keep 
the  problem  on  a  one-dimensional  level  with  respect  to  the  other  terms  in  the  expression. 
This  equation  represents  the  best  of  both  worlds,  in  a  sense;  the  number  of  multiplica- 
tion steps  is  decreased  and  the  number  of  permutation  steps  is  held  fixed.  Therefore, 
using  this  identity  along  with  the  decompositions  developed  in  this  chapter  would  seem 
to  be  the  best  path  to  follow  for  implementation  on  a  mesh-connected  array. 

4.2.  Complexity  of  the  FFT-Based  Method 

In  this  section  we  derive  upper  bounds  on  the  number  of  parallel  steps  required  to 
implement  the  algorithms  implied  by  the  decompositions  derived  in  the  previous  section. 
We  consider  a  basic  step  to  be  the  equivalent  of  one  ©  operation  with  a  local  template. 
Thus,  given  a  tridiagonal  decomposition  of  a  matrix,  we  consider  the  parallel  complexity 
to  be  the  number  of  tridiagonal  matrices  appearing  in  the  decomposition  plus  the 
number  of  parallel  steps  required  to  implement  the  permutations.  Due  to  the  structure 
of  the  FFT-based   decompositions,  the  complexity  can   be  divided  into  multiplication, 
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addition,  and  permutation  terms.  We  briefly  consider  the  parallel  multiplicative  and 
additive  complexity  of  the  algorithms  implied  by  the  previous  decompositions.  Since 
these  decompositions  are  related  to  FFTs,  the  associated  complexity  counts  are  similar 
to  those  of  FFTs.  The  complexity  of  the  permutations  is  then  considered.  It  will  be  seen 
that  it  is  the  latter  that  dominates  the  overall  complexity  in  most  cases  in  terms  of  total 
number  of  steps.  This  can  be  attributed  to  the  fact  that  FFTs  are  based  on  reindexing 
schemes.  When  implementing  an  algorithm  on  an  architecture  such  as  mesh-connected 
arrays,  reindexing  can  be  an  "expensive"  operation.  A  local  permutation  step  is  much 
less  expensive  than  a  floating  point  multiplication  however.  We  parameterize  the  times 
required  to  implement  a  local  multiplication  step,  a  local  addition  step,  and  a  local  per- 
mutation step  based  on  execution  times  for  the  Massively  Parallel  Processor  to  compare 
the  general  radix  and  twiddle  free  identities  [60]. 

For  any  matrix  A  we  denote  by  Cm(A)  and  Ca(A)  the  number  of  parallel  multiplica- 
tion and  addition  steps  required  to  implement  A  locally  on  a  linear  array.  By  a  parallel 
permutation  step  we  mean  the  action  of  switching  the  data  in  horizontally  or  vertically 
adjacent  cells.  We  denote  by  Cr(A)  the  number  of  parallel  permutation,  or  data  routing, 
steps  required  to  implement  A  locally.  We  acknowledge  that  there  is  some  ambiguity  in 
this  notation  since  there  can  be  more  than  one  decomposition  available  for  a  given  A  and 
the  complexity  measures  with  respect  to  the  different  decompositions  may  be  different. 
It  will  either  be  explicitly  stated  which  decomposition  is  being  used  or  it  will  be  clear 
from  the  context. 

Remark  4.14.  Note  that  if  x  =  m,a,  or  r,  then 
CX(AB)  =  CX(A)  +  CX(B) 


102 

and,  since  we  are  considering  parallel  complexity, 

Cx(Ii<8>  A)  =  CX(A) 

for  any  matrices  A  and  B  and  positive  integer  j. 

We  first  consider  the  number  of  parallel  multiplication  steps.  We  suppress  the  mul- 
tiplication by  n-1'2. 

Theorem  4.15.  Let  n  be  as  in  Theorem  4.7. 

1.)  Using  Corollary  4.8  followed  by  Corollary  4.11  to  decompose  Fn  results  in 


Cm(Fn) 


E(kjCm(F  )  +  kj) 


s. 


2.)  Using  Corollary  4.10  followed  by  Corollary  4.11  to  decompose  Fn  results  in 


Cm(Fn) 


E(kjCm(F  )  +  kj) 

j=i 


-  1. 


where  we  allow  s  to  be  equal  to  1. 


3.)  If  p  is  an  odd  prime,  then  using  Theorem  4.12  results  in 


Cm(Fp)  =  2Cm(Fp_1)+l. 


Proof 


1.)  By  Corollary  4.8, 


CJFJ 


ECm(Ej) 

j=0 

ECm(FqJ 


+  Cm(L    OF  J 


j=0 


+  Cm(FJ=  ECm(F  k). 


By  Corollary  4.11,  if  p  is  any  prime  and  k  >  1 
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k-2 


Cm(F  i)  =  E 

m=0 
k-2 


Cm(Ik_l(g)Fp)  +  Cm(Im(g)Dv.m) 


+  Cm(Fp)  = 


=   E  Cm(Fp)  +  k-1  +  Cm(Fp)  =  kCm(Fp)  +  k-1. 

m=0 


Hence, 


Cm(Fn)  = 
which  proves  1.). 


E(kjCm(F  )  +  kj) 


2.)  The  only  difference  here  (in  terms  of  multiplications)  is  the  presence  of  the  twid- 
dle factors.  Hence, 


Cm(F„)  = 


ECm(F  k]) 
j=i         pj 


+  s-l 


E(kjCm(F  )  +  kj) 


-  1. 


3.)  This  follows  immediately  from  Theorem  4.12. 


Q.E.D. 


Hence,  the  difference  in  the  number  of  parallel  multiplications  required  by  the  two 
different  methods  is  s-1  multiplication  steps.  Furthermore,  as  pointed  out  in  the  previ- 
ous section,  on  a  two-dimensional  array  the  difference  can  be  kept  fixed  at  s-1  multiplica- 
tion steps  without  increasing  the  number  of  steps  required  to  execute  the  other  opera- 
tions.   Note  that  Cm(F2)  =  0  and  that  if  n  =  2k,  then  Cm(n)  =  log2n-l. 

Theorem  4.16.    Using  either  method  results  in  the  following  expressions  for  the  additive 
complexities: 


1.)  Ifn  =  TJpj  J  is  not  prime,  then 
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Ca(Fn)=  Ekj(Ca(Fpj)). 

2.)  If  p  is  an  odd  prime,  then 

Ca(Fp)  =  2Ca(Fp_1)  +  2(p-l). 

Proof. 

The  proof  of  1.)  is  almost  identical  to  the  proofs  of  the  previous  theorem.  The  only 
difference  is  that  the  twiddle  factors  require  no  additions.  The  proof  of  2.)  follows 
immediately  from  Theorems  4.12  and  4.13. 

Q.E.D. 

Note  that  the  additions  due  to  the  matrices  Up  will  have  a  significant  influence  on 
the  the  additive  complexity.  This  is  because  the  method  of  computing  the  Up,  although 
local,  is,  except  for  one  step,  essentially  serial. 

We  now  move  to  a  discussion  of  the  permutation  complexity.  The  permutations  can 
be  executed  locally  by  the  odd-even  transposition  sort  (OETS)  which  is  a  parallel  algo- 
rithm  for  sorting  data  arranged  in  a  one-dimensional  array. 

Remark  4.17.  It  has  been  shown  that  the  OETS  will  execute  an  arbitrary  permutation  of 
n  objects  on  a  linear  array  in  at  most  n  parallel  steps  [28,  62].  We  will  show  that, 
because  of  the  special  structure  of  the  shuffle  permutations,  the  OETS  will  execute  the 
permutation  amk  corresponding  to  P(m,k)  in  either  (m-l)(k-l)  or  (m-l)(k-l)+l  parallel 
steps.  In  order  to  keep  the  complexity  estimates  together,  we  delay  proving  this  result 
until  after  we  have  derived  upper  bounds  on  the  parallel  permutation  complexities. 


In  what  follows,  we  make  the  convention  that   J]  x.  =  0  if  n  <  m. 

j=m 


105 

Theorem  4.18.    Upper  bounds  on  the  number  of  parallel  permutation  steps  required  to 
implement  the  various  algorithms  are: 

1.)  If  n  is  as  in  Theorem  4.7  and  Corollary  4.8  is  used  to  decompose  Fn,  then 

Cr(Fn)  <  n  +  £cH  +  (Psk_r-l)(Psk-l)  +  1  +  ECr(F  *,). 
j=o  j=i        PJ1 

2.)  If  n  is  as  in  Theorem  4.7  and  Corollary  4.10  is  used  to  decompose  Fn,  then 

Cr(FJ  <  3n  -  2£Pjkj  +  4(s-l)  +  £Cr(F  kj). 
j=i  j=i         pJ 

3.)  //"n  =  pk  for  k  >  2  and  Corollary  4.11  is  used  to  decompose  Fn,  then 

Cr(Fn)  <  3n  -  2pk  +  kCr(Fp). 
4.)  Ifn  =  p  is  an  odd  prime  and  Theorem  4.12  is  used  to  decompose  Fn,  </?en 

Cr(Fn)  <  2(n-l)  +  2Cr(Fn_1). 
Thus,  in  any  case,  the  upper  bounds  on  the  number  of  parallel  permutation  steps  is  linear 

in  n,  that  is,  Cr(Fn)  is  o(n). 


Proof 


1.)  Denote  R  =  $]Cr(F  k|).    Since  cs  x  =  ps  s, 
j-l         piJ 


Cr(Fn)  < 


ECr(Rj) 
j=o 


k„_,      k 


s-l 


+  cr(p(Ps_r,Ps  s))  +  r  +  cr(n(v.  ^q^s-j))) 


j=i 


By  definition  of  Rj  and  Remark  4.14, 


Cr(Fn)  <  Cr(P(qj,Cj)(Iq.(8)Q2j+1)(Iq.(E)P(cj+1,qj+1)))  <  qjCj  =  cH. 

Since  P(Psk_r,psk5)  is  a  shuffle  permutation,  ^(P^^Ps'8))  <  (pskr-l)(Psks-l)+l- 

s-l 
The  permutation  Q  =  ]^I(InSH_1<2)Q2(s-j))  can  be  implemented  as  a  sequence  of  per- 
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mutations  or  as  a  one  step  permutation.    Note  that 

Q2  =  Tqi(Ccf  JP(c1>qi) 

so 

Cr(Q2)  <  Cj  +  (Cl-l)(qi-l)+l  =  n  -  qi  +  2  =  c0  -  qi  +  2. 

Similarly 


Q2(s-j)  =  T^(C<£)P(cHlqH) 


so 


Cr(Q2(s-j))  <  cs_j  +  (cs_j-l)(qs_j-l)+l  =  cs_H  -  qs_j  +  2. 
Thus,  implementing  Q  as  a  sequence  of  permutations  leads  to 

Cr(Q)  <  EcH  =  n  +  ^-  +  -2-+   ■••   +  -5-  >  n. 
j=i  "i         "2  ns_2 

Implementing  Q  as  a  one  step  permutation  yields  Cr(Q)  <  n  . 

Combining  the  observations  made  in  the  previous  two  paragraphs  yields 

cr(Fn)  <  cr(Q)  +  2cH  +  (Ps"r-i)(Psk-i)  +  1  +  R 

j=0 
which  proves  1.). 

2.)  Let  R  be  as  in  1.).  By  Corollary  4.10, 
Cr(Fn)  <  2£Cr(PCH)  +  Cr(n(InH®PCH))  +  R. 

j=l  j=2 

k  k 

Since   the  Pc     =  P(cj,pj  J)   are  shuffle   permutations,   Cr(Pc    )  <  (cj-l)(pj  j-l)+l.   Note 
that  CjPj  J  =  Ci  j.    Hence, 

SCr(PCjJ  <  gcH  -  Sej  -  EPjkj  +  2(s-l)  = 
j=i  j=i  j=i        j=i 
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s-l 


k.      s^i     k, 


n-PsS-  EPjJ  +  2(s-l)  = 


=  n-  EPjJ  +  2(s-l). 
i=i 

By  similar  reasoning  as  used  in  1.)  for  the  permutation  matrix  Q, 


CrdlflnJ&PcJ)  <  n 


j  =  2 


s-j  CH' 


Thus, 


s-l 


Cr(Fn)  <  2n  +  4(s-l)  -  2£Pj  J  +  n  +  R 

j=i 

which  proves  2.). 

3.)  By  Corollary  4.11 


Cr(Fn)  =  Cr(Gp(IpM®  Fp)Hp) 

=  2  2  Cr(P(pk-m-\p))  +   E  Cr(Fp)  +  Cr(Hp)  < 


m=0 
k-2 

<2E 

m=0 


m=0 


(pk-n-l-l)(p_l)+l  ]  +  kCr(Fp)  +  n. 


A  little  algebra  yields 


E  [(p'-^-iXp-iJ+i  ] 


k-2 

E 

m=0 


k-2 


=  P"(P-1)    E  (-)m    -  (k-l)(P-l)  +  (k-l)  = 

(m=0    P        J 


=  pfo'"1-!)  -  p(k-l)  =  n  -  pk. 
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Therefore, 

Cr(Fn)  <  3n  -  2pk  +  kCr(Fp) 
which  proves  3.). 

4.)  This  inequality  follows  immediately  from  Theorem  4.12. 

Q.E.D. 

The  upper  bounds  on  the  permutation  complexities  can  be  sharpened  in  certain 
special  cases.  This  is  because  the  structure  of  the  permutations  is  not  destroyed  (or 
clouded  over)  by  the  presence  of  too  many  permutation  matrices. 

Theorem  4.19.  If  n  is  as  in  Theorem  4.7  with  s  =  2  then 
1.)  If  the  twiddle  free  identity  is  used  to  decompose  Fn,  then 

Cr(Fn)  <  3(n+6)  -  2(p2kz  +  Plkl)  +  Cr(F  *J  +  Cr(F  0- 

Pi  PS 

2.)  If  the  general  radix  identity  is  used  to  decompose  Fn,  then 

Cr(Fn)  <  3(n+2)  -  3(Plkl  +  p^2)  +  Cr(F  kJ  +  Cr(F  g. 

Pi  p2 

3.)  If  a  =  p2,  then 

Cr(Fn)<3(p-l)2  +  3  +  2Cr(Fp). 

Proof 

k  k 

1.)  Let  m  =  pj  '  and  k  =  p2  2.   The  twiddle  free  identity  can  be  written 

Fn  =  P(k)m)Tk(Cn;k*)(Ik(8)Fm)P(m,k)(Im®Fk)Tm(Ckm*)P(k)m). 
Since    Tm(Ckm )     is    a    block    diagonal    permutation     matrix    with     k     x    k     blocks, 
Cr(Tm(Ckm*))  <  k.    Hence, 
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Cr(Tm(Ckm  )P(k,m))  <  (k-l)(m-l)+l  +  k  =  n-m+2. 
Similarly, 

Cr(P(k,m)Tk(C-k*))  <  (k-l)(m-l)+l  +  m  =  n-k+2. 
Whence, 

Cr(Fn)  <  n-k+2  +  (m-l)(k-l)+l  +  n-m+2  +  Cr(Fm)  +  Cr(Fk) 
=  3n  +  6-k-2m  +  Cr(Fm)  +  Cr(Fk) 
which  proves  1.). 

2.)  Let  m  and  k  be  as  in  1.).  The  general  radix  identity  is 

Fn  =  P(k,m)(Ik®FJP(m,k)D(k,m)(Im(g>Fk)P(k,m)  . 
Thus, 

Cr(Fn)  =  3Cr(P(k,m))  +  Cr(Fm)  +  Cr(Fk)  < 

<  3(m-l)(k-l)+3  +  Cr(Fm)  +  Cr(Fk)  = 
=  3n  -  3(m+k)  +  6  +  Cr(Fm)  +  Cr(Fk) 
which  proves  2.). 

3.)  In  this  case  the  general  radix  identity  is 

Fn  =  P(P)p)(Ip®Fp)P(p>p)D(p)p)(Ip®Fp)P(p)p) 
from  which  the  desired  inequality  follows  immediately. 


Q.E.D. 


Remark  4.20.  Note  that  the  complexity  bound  corresponding  to  the  twiddle  free  identity 
in  Theorem  4.18  is  not  symmetric  in  the  prime  factors  of  n.    This  is  due  to  the  fact  that 

k  k 

a        better        estimate        can        be        obtained        for        PlPs-t'iPs  s)        tnan        f°r 

(P(qj,Cj)(Iq®Q2j+i)(Iq.<8>P(cj+1,Qj+1)).    Thus,  if  using  this  identity  it  would  be  wise  to 

k  k 

order  the  prime  factors  so  that  the  quantity  (px  '  -  l)(p2  "  -1)  is  minimized. 
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Remark  4.21.  The  difference  in  the  bounds  in  1.)  and  2.)  of  Theorem  4.18  is  the  quantity 

D  =  §  CH  -  2n  +  (Pi"'  "  1XP2k2  -1)  +  2  E  Pjkj  +  1  +  4(s-l). 
j=o  j=i 

k      k  s       k  k  k 

Since  c_j  =  n,  p1  *p2  2  <  n,  and  2  £)  pj  J  -  px  l  -  p2  2  >  0,  it  follows  that  D  >0.  Thus,  as 

j=i 

expected,  it  takes  more  permutation  steps  using  the  twiddle  free  identity.  In  fact,  let  P0 
denote  the  time  it  takes  to  execute  one  parallel  permutation  step  and  M0  the  time  it 
takes  to  execute  one  multiplication  step.  Then,  disregarding  the  overhead  involved  in 
implementing  the  various  operations,  for  the  general  radix  identity  to  be  preferable  to 

the  twiddle  free  identity  we  should  have  (s-l)M0  <  DP0  or  R  =  <  .    In  the 

P0  s-1 

k  k 

special  cases  given  in  Theorem  4.19  the  difference  in  the  bounds  is  D  =  pj  '  +  p2  2.  On 
the  Massively  Parallel  Processor  the  quantity  R  is  on  the  order  of  103  [60].  Since  the 
GAPP  has  the  same  clock  rate  as  the  MPP  it  is  reasonable  to  expect  that  the  quantity  is 
about  the  same  for  the  GAPP  [54].  What  we  are  neglecting  here  is  the  fact  that  the 
physical  actions  of  moving  the  data  or  performing  the  multiplications  need  to  be  con- 
trolled. The  instructions  need  to  be  broadcast  to  the  individual  processors.  Moreover, 
when  implementing  the  permutations  comparisons  need  to  be  made  in  order  that  a  deci- 
sion can  be  made  to  move  the  data  or  not.  These  things  take  time  and  are  valid  con- 
siderations when  attempting  to  compare  and  evaluate  the  algorithms  resulting  from  the 
decompositions.  An  evaluation  of  the  effects  of  such  considerations  requires  a  knowledge 
of  the  basic  instructions  of  the  processors,  which  we  do  not  have. 

We  have  computed  some  of  the  quantities  of  interest  related  to  the  complexities  of 
the  various  decompositions  and  used  them  to  construct  Table  4.1. 


Ill 


Table  4.1.  Estimated  number  of  parallel  steps  required  to  implement  Fn  locally, 
n  Cr:GR   Cr:TF        Cm:GR   Cm:TF   Ca(Fn) 


100  (25-4) 

227 

316 

9 

8 

26 

120  (8-5-3) 

367 

400 

8 

6 

21 

128 

356 

- 

6 

- 

7 

256 

736 

- 

7 

- 

8 

360  (9-8-5) 

1096 

1161 

11 

9 

27 

500(125-4) 

1506 

1648 

21 

20 

38 

512 

1500 

- 

8 

- 

9 

GR  denotes  use  of  the  general  radix  identity.  TF  denotes  use  of  the  twiddle  free  identity. 
Cr  denotes  permutation  steps.    Cm  denotes  multiplication  steps.    Ca  denotes  addition  steps. 


We  now  show  that  the  shuffle  permutations  corresponding  to  P(m,n)  are  executed 
by  the  odd-even  transposition  sort  (OETS)  in  either  (n-l)(m-l)  or  (n-l)(m-l)+l  steps. 
The  OETS  is  a  method  of  executing  a  permutation  of  a  linear  array  of  data  using  a 
sequence  of  adjacent  transpositions.  The  idea  is  to  apply  the  same  sequence  of  transpo- 
sitions to  the  data  that  is  required  to  sort  the  vector 

(0(0),  <r(l),  .  .  .  ,  a(n-l)) 
into  increasing  order,  where  a  is  the  permutation  to  be  implemented. 

We  first  formally  define  the  OETS  and  give  an  example  of  it.  We  then  establish 
some  notation  and  prove  a  preparatory  lemma  before  proving  the  aforementioned  result. 

Let  a  be  any  permutation  on  Zmn  and  define  a  function  h^  =  h  :  {0,1,2,...  }  — ►  Z^," 
by  defining 

K(0)  s  (<r(0),  (7(1),  .  .  .  ,  a(mn-l)) 
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and,  for  each  k  6  [l,oo),  defining  h(k)  recursively  as  follows: 


1.)  If  k  is  odd,  then  h*(k)  =  (hk0,  hkil,  .  .  .  ,   hk^^) 


where 


hk-i,j-i        if  J  is  odd  and  hk-i,j  <  hk-i,j-i 
hk,j  ==  <  hk-i,j+i       if  J  is  even>  J  <  n"1"1.  and  hk-i,j  >  hk-i,j+i 


2.)  If  k  is  even,  then  h(k)  =  (hk0,  hkjl, 


'k.mn 


,) 


where 


hk-i,j-i        if  J  is  even>  J  7^  0.  and  hk-i,j  <  hk-l,j-l 

hk,j  =  <  hk-i,j+i       if  J  is  odd.  and  hk-i,j  >  hk-i,j+i 
h|c_1 :  else 

As  mentioned  earlier,  it  has  been  shown  that  if  a  is  any  permutation  on  Zmn,  then 

h(mn-l)  =  (0,l,2,...,mn-l)  and  that  if  k  >  mn-1,  then  h(k)  =  h(mn-l).    The  algorithm 

used  to  generate  the  vectors  h(k)  from  h(0)  using  steps  1.)  and  2.)  is  called  the  odd-even 

transposition  sort.    The  fact  that  h(mn-l)  is  arranged  in  increasing  order  means  that  the 

coordinates  of  h(0)  have  been  permuted  by  a.    The  following  example  shows  how  the 

OETS  works. 

Let  a  —  cr3i  be  the  shuffle  permutation  corresponding  to  the  mapping  a+3b  — ► 
4a+b.  Then 

K(0)  =  (0,4,8,1,5,9,2,6,10,3,7,11) 

K(l)  =  (0,4,1,8,5,9,2,6,3,10,7,11) 

K(2)  =  (0,1,4,5,8,2,9,3,6,7,10,11) 


K(3)  =  (0,1,4,5,2,8,3,9,6,7,10,11) 


. 


113 

K(4)  =  (0,1,4,2,5,3,8,6,9,7,10,11) 

K(5)  =  (0,1,2,4,3,5,6,8,7,9,10,11) 

K(6)  =  (0,1,2,3,4,5,6,7,8,9,10,11) 

To  permute  a  one-dimensional  array  of  data  (a(0),a(l),...,a(ll))  locally  using  the 
OETS  one  uses  the  sequence  of  transpositions  used  to  generate  h(6)  from  h(0).  Note  that 
the  algorithm  stabilizes  in  (3-l)(4-l)  steps. 

We  now  establish  notation.  Let  n,m  >  1  be  arbitrary  but  fixed  positive  integers 
and  let  h  =  h^    .    Let  px(k)  denote  the  position  of  x  6  Zmn  at  step  k,  that  is,  if  x  =  hkj 

then  px(k)  =j.  Let  bx(k)  denote  the  number  of  elements  y  £  Zmn  with  the  property  that 
py(k)  <  px(k)  and  y  >  x.  Similarly,  let  bx(k)  be  the  number  of  elements  y  £  Zmn  with 
the  property  that  py(k)  >  px(k)  and  y  <  x.  Note  that  if  bx(k)  =  0  for  every  x  £  Zmn, 
then  h(k)  is  sorted  in  increasing  order.  Note  further  that  if  bx(k)  =  0,  then  bx(j)  =  0  for 
every  j  >  k.  These  observations  hold  true  for  bx  also.  The  criteria  that  we  shall  use  to 
determine  if  h  is  sorted  is  given  in  the  next  lemma. 

Lemma  4.23.  If  Zmn  =  HUK  such  that  for  some  positive  integer  k,  x  £  H  and  y  6  K 
implies  that  x  <  y  and  bx(k)  =  by(k)  =  0,  then  h(k)  is  sorted. 

Proof. 

Assume  that  the  lemma  is  not  true.  Then  there  exists  z  £  Zmn  such  that  pz(k)  7^  z. 
In  fact,  we  can  take  z  such  that  pz(k)  >  z.  There  must  be  a  w  £  Zmn  such  that  w  >  z 
and  pw(k)  <  pz(k)  since  there  are  more  than  z  positions  to  fill  to  the  left  of  z.  Hence 
bz(k)  7^  0  which  implies  that  z  ^  H  so  z  £  K.  The  same  inequalities  imply  that 
bw(k)  =£  0.  Thus,  w  ^  K  so  w  £  H.  By  hypothesis,  this  implies  that  w  <  z  which  con- 
tradicts w  >  z.    Therefore,  we  conclude  that  the  lemma  is  true. 
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Q.E.D. 


Before  proving  the  Theorem  4.24,  we  point  out  that,  although  the  proof  is  long,  it 
is  not  hard  to  get  an  intuitive  feel  for  why  the  OETS  works  so  well  for  the  shuffle  per- 
mutations. By  writing  out  a  few  of  the  permutations  the  reader  can  see  that  they  are 
analogous  to  dealing  a  hand  of  cards.  The  objects  are  sorted  in  a  uniform,  shuffled 
fashion.  Therefore,  after  the  first  few  steps,  many  exchanges  are  made  at  each  step.  For 
those  that  prefer  computational  eveidence  to  proofs,  a  computer  program  was  written 
which  executes  P(m,n)  using  the  OETS.  This  program  was  run  on  all  pairs  m,n  £  [2,100] 
and  Theorem  4.24  was  verified  in  every  case. 

Theorem  4.24.  The  odd-even  transposition  sort  will  execute  P(m,n)  in  either  (n-l)(m-l)  or 
(n-l)(m-l)+l  parallel  steps. 

Proof. 

The  proof  is  divided  into  two  cases:  m  even  and  m  odd. 
Case  1.  m  even. 

In  this  case  we  can  write 

K(0)  =  (Ej1),  E<f2>  E?\  E/2)  .  .  .  ,  E£l  E i_2l) 
where 

EiW  =  (i,  n+i,  2n+i,  .  .  .  ,   (-|-l)n+i) 

Ei(2)  =  (yn+i,  (i+ljn+i,  .  .  -  ,   (m-l)n+i). 
By  abuse  of  notation,  we  write  x  £  Ef1'  to  denote  that  x  is  one  of  the  entries  of  the  vec- 
tor E;(j). 
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Let  i  6  [O.n-1].  We  show  that  if  x  E  Ej(1),  then  bx(l+i(m-l))  =  0.  Thus,  in  particu- 
lar, it  will  follow  that  if  x  6  Ei(j)  for  any  j  €  [0,n-l],  then  bx(l+(n-l)(m-l))  =  0. 

Note  that  if  x  €  E0(j),  then  bx(0)  =  0.    Let  i  6  [l,n-l]  and  j  €  [0,  ~l}.    Let  x  = 

jn+i  and  let  kx  be  the  smallest  integer  such  that  bx(kx)  =0.  Note  that  due  to  the  restric- 
tions on  i  and  j,  x  £  E-}1'.   We  show  that  the  following  claims  are  true: 

Claim  1.)  If  k  £  [0J+1],  then  px(k)  =  px(0). 

Claim  2.)  If  k  6  [j+2,kx],  then  px(k)  =  px(k-l)  -  1. 
Note  that  Claims  1.)  and  2.)  together  imply  that  kx  =  j  +  1  +  bx(0). 

Since  m  is  even,  the  comparisons  made  at  the  first  step  of  the  algorithm  are  all 
made  within  the  vectors  Eg'1'  for  s,t  6  [0,n-l].  These  vectors  are  already  sorted  in  increas- 
ing order.  Thus,  there  are  no  exchanges  made  on  the  first  pass  of  the  algorithm.  Claim 
1.)  then  follows  from  the  fact  that  if  y  €  E-^\  z  £  E$,  and  w  £  E-^\  then  z  >  y  and  w 

>  y- 

We  prove  Claim  2.)  by  double  induction  on  i  and  j.  Let  i  =  1  and  j  =  0.  The 
claim  is  then  that  px(k)  =  Pi(k-l)  -  1  for  k  £  [2,  kx].  By  Claim  1.),  p^l)  =  Pl(0). 
Since  p0(0)  =  0,  the  inequality  0  <  pz(k)  <  px(k)  implies  that  z  >  1.  Thus, 
Pi(k)  =  p^k-1)  -  1.  Assume  that  the  claim  is  true  for  every  x  =  kn+1  with  k  £  [0,j— 1] 
and  j  £  [0,-1].  By  Claim  1.), 

P(j-l)n  +  l0+2)   <  P(j-l)„+l(j  +  l)   <  P(j-l)n  +  l(j)  =  P(j-l)n  +  l(H)  =     •  ■  •    =  p(j_1)n+1(0) 

and 

Pjn+lO+2)  <  Pjn+i(j+l)  =  Pjn+l(j)  =  Pjn+l(J-l)  =    '  '  '    =  Pj„+l(0). 
Note  that  if  z  6  Ej(1)  and  y  6  E0(2\  then  z  >  y.    Furthermore,  if  z  €  E0(1)  and  z  >  x, 
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then  z  >  (j-l)n+l  =  x-n  which  implies  that  kQ_1^n+1  >  kx.  Putting  these  last  three 
observations  together  yields  the  equality  Pjn+i(k)  -  P(j_i)n+i(k)  =  2  for  k  €  [j+2,kx]  C 
[j+2,k(j_1)n+1].  By  the  induction  hypothesis  on  j,  P(J_1)n+1(k)  =  p(j_1)n+1(k-l)  -  1  for  k  £ 
[j+2,kj  C  [j+l,k(H)n+1].   Thus,  pjn+1(k)  =  Pjn+1(k)  -  1  for  k  £  [j+2,kx]. 

Assume  that  for  some  i  £  [2,n-l],  Claim  2.)  is  true  for  every  Jon+io  with  i0  £  [1  ,i- 1] 

and  j0  £  [0, 1].    Let  x  =  i  and  note  that  {  j  :  j  <  i  and  pj(0)  <  p;(0)  }  =  {0,l,...,i-l}. 

Thus  bj(0)  =i(m-l).  By  the  induction  hypothesis  on  i,  Pj_i(k)  =  Pj_i(k-l)  -  1  for  k  6 
[2,kw].  Moreover,  kw  =  0  +  1  +  (i-l)(m-l)  which  yields  k;  >  bx(0)  = 
i(m-l)  >  i(m-l)  +  2  -  m  =  k^.  Assume  that  k  £  [2,kj],  pz(k)  =  p;(k)  -  1,  and  z  <  i. 
Then  z  =  i-1  so  bj_j(k)  =  0.  Therefore,  bj(k)  =  0  since  if  pw(k)  <  Pj(k),  and  w  7^  i-1, 
then  pw(k)  <  Pi_i(k).  Hence,  k  =  kj.  This  implies  that  for  every  k  £  [2,kj], 
Pi-i(k)  =  Pi-i(k-l)  -  1. 

Assume  that  for  some  j  £  [1, 1],  Claim  2.)  is  true  for  every  hn+i  with  h  £  [l,j-l] 

and  that  it  is  not  true  for  x  =  jn+i.  Let  k0  be  the  smallest  of  all  the  integers  k  in  the 
interval  [j+2,kx]  with  the  property  that  px(k)  =  px(k-l).  Note  that  if  k0  =  kx,  then 
bx(kx-l)  =  0  which  contradicts  the  choice  of  kx.  Furthermore,  k0  =  j+2  would  imply 
that  px(j+l)  =  Px(j+2)  which  is  false  by  Claim  1.). 

If  pz(k0-l)  =  px(k0-l)  -  1,  then,  by  our  choice  of  k0,  z  <  x.  For  such  a  z,  it  must 
be  true  that  pz(k0)  =  pz(k„-l)  so  pz(k0)  =  px(k0-l).  By  Claim  1.),  px(k)  =  px(0)  for  k  £ 
[0,j+l].  By  the  induction  hypothesis,  if  y  =  sn+i  with  s  <  j,  then  py(j+l)  <  py(0). 
Furthermore,  if  z  £  Ejiy,  then  x  <  z  and  if  z  =  sn+i-1  £  E$  and  s  >  j,  then  z  >  x. 
Thus,  if  pz(j+l)  =  px(j+l)  -  1,  then  z  <  x  so  px(j+2)  =  px(j+l)  -  1.  Hence,  there 
exists  a  k1  with  kx  <  k0  <  kx  and  pz(kx)  =  p^kj-l),  that  is,  x  has  to  catch  up  to  z.    By 
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the  induction  hypothesis  on  i,  this  implies  that  bz(kx)  =  0.  Since  kj  <  k0  <  kx, 
bx(k0)  7^  0.  Since  pz(k0)  =  px(k0)  -  1  and  x  >  z,  it  must  be  true  that  bz(k0)  7^  0  which 
implies  that  bz(kj)  7^  0  which  is  a  contradiction.   This  proves  Claim  2.). 

We   now    need    only   show    that    given    x   =    in+j,    kx  <    l+i(m-l).     Recall   that 
kx  =  j  +  1  +  bx(0).     By    definition,    x    <    z    for    every    z  6  Ek'2',    k    6    [0,n-l],    and 

[Ek(2)|  =  HL     Since  x  is  the  jth  element  of  Ej'1',  x  is  less  than  —  -  (j+l)  of  the  elements 

of    Eft    for    k     <     i.      Hence    bx(0)    =    i-^-  +  i(-^-(j+l))    =    i(m-(j+l)).      Thus, 

kx  =  im  -  (j+l)(i-l)  so  the  inequality  will  be  true  if  and  only  if  (l-i)(j+l)  <  1-i  for  i  6 

[l,n-l]  and  j  £  [0, 1].    If  i  =  1,  then  the  inequality  is  clearly  satisfied.  If  i  >  1,  then 

the  inequality  is  equivalent  to  j+l  >  1  which  is  also  clearly  true.  Thus,  we  may  con- 
clude that  if  x  e  Ek(1)  for  some  k,  then  bx(l+(n-l)(m-l))  =  0. 

By  replacing  <'s  by  >'s  and  b's  by  b's  the  same  argument  can  be  used  to  show 
that  if  x  G  Ek(2)  for  some  k,  then  bx(l+(n-l)(m-l))  =  0.    Thus,  by  Lemma  4.23,  the 
theorem  is  proved  in  the  case  m  even. 
Case  2:  m  odd. 

In  this  case  we  write 

K(0)  -  (Ei1),  m0,  E0(2>  Eft,  mh  E?\  ...,  Efi,  mn_1;  E  „?]) 
where 

E,W  =  (i,  n+i,  2n+i,  .  .  .  ,   (^-)n+i) 

mi  =  (_^_)n+i 

E.(2)  =  ((m±l)n+i)  (^+i)n+1,  .  .  .  ,   (m-l)n+i). 
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Since  irij  >  x  for  every  x  G  E^1)  and  m;  <  x  for  every  x  G  Ej(2),  the  argument  that  was 
used  in  Case  1  can  be  used  with  very  little  modification  to  show  that  if  x  G  E-}1'  for  any 
i,  then  bx(l+(n-l)(m-l))  =  0  and  if  x  £  Ei(2)  for  any  i,  then  bx(l+(n-l)(m-l))  =  0.  One 
modification  to  be  made  is  in  Claim  1.).  Since  m  is  odd,  some  exchanges  take  place  dur- 
ing the  first  pass  of  the  algorithm.  This  results  in  kx  <  j  +  1  +  bx(0)  rather  than 
kx  =  j  +  1  +  bx(0).  One  can  then  proceed  with  the  same  proof  keeping  in  mind  that  for 
every  i  G  [0,n-l],  x  6  E^1]  and  y  G  Eji?  imply  that  x  <  nij  <  y  which  allows  for  the 
steady  movement  of  the  data.  After  accounting  for  these  facts,  the  inequality  derived 
after  the  proof  of  Claim  2.)  must  be  rederived,  i.e.,  one  needs  to  show  that  if  x  =  in+j, 
then  kx  <  l+i(m-l).    This  can  be  accomplished  by  substituting  m-1  for  m.    Let  x  = 

( )n    +    n-1.    Then,    since    x    =    m0-l  <  mj  <    •  •  ■    <  mn_1;    if   bx(k)  =  0,    then 


b    (k)  =  0       for       i       G       [0,n-l].        Since       x  G  En(i],       bx(l+(n-l)(m-l))  =  0 


so 


bmi(l+(n-l)(m-l))  =  0  for  i  G  [0,n-l].    Note  that  if  y  G  E;(2)  for  any  i,  then  mn_x  <  y. 

Hence,  Lemma  4.23  applies  and  we  may  conclude  that  the  OETS  takes  at  most  (n-l)(m- 
l)+l  steps  in  the  case  that  m  is  odd. 

We  conclude  the  proof  by  noting  that,  since  |pn_1(0)  -  (n-l)|  =  (n-l)(m-l)  and  since 
there  may  be  no  change  in  the  first  step  of  the  algorithm,  the  OETS  takes  at  least  (n- 
l)(m-l)  steps. 

Q.E.D. 

We  have  shown  how  matrix  identities  associated  with  the  FFT  can  be  used  to 
develop  tridiagonal  decompositions  of  Fourier  matrices  and  we  have  evaluated  the  result- 
ing algorithms.  We  now  derive  a  completely  different  method  for  computing  tridiagonal 
decompositions  of  Fourier  matrices. 


CHAPTER  5 

TRIDIAGONAL  DECOMPOSITIONS  OF  FOURIER  MATRICES 

BY  OBLIQUE  ELIMINATION 


In  this  chapter,  we  show  how  a  technique  called  oblique  elimination  can  be  used  to 
develop  alternative  tridiagonal  decompositions  of  the  Fourier  matrices.  This  method  has 
advantages  over  the  FFT  based  method.  The  major  advantage  is  that  there  is  relatively 
little  data  manipulation  required  to  implement  the  algorithms  implied  by  these  decompo- 
sitions. Another  advantage  is  that  the  parallel  arithmetic  operation  counts  are  all  the 
same  linear  function  of  n  for  every  n  regardless  of  the  compositeness  of  n.  This  is  due  to 
the  fact  that  oblique  elimination  is  a  method  based  on  numerical  linear  algebra  rather 
than  traditional  discrete  Fourier  transform  methods.  A  disadvantage  is  that  the  number 
of  multiplication  steps  is  higher  than  for  the  FFT-based  method,  particularly  for  values 
of  n  such  as  powers  of  two. 

The  parallel  algorithms  resulting  from  these  decompositions  can  be  used  alone  or  in 
conjunction  with  the  FFT-based  method.  Specifically,  the  FFT-based  method  could  be 
used  to  break  the  computation  into  prime  components.  The  method  described  in  this 
chapter  could  then  be  used  in  place  of  the  Rader  prime  algorithm  and  the  convolution 
theorem,  since  the  use  of  those  techniques  requires  a  number  of  arithmetic  and  data 
manipulation  steps. 

Chapter  5  is  divided  into  four  sections.  In  the  first  section,  we  describe  the  theoreti- 
cal basis  of  oblique  elimination.  We  then  use  the  theory  to  develop  an  algorithm,  which 
we  call  minimal  variable  oblique  elimination,  for  implementing  oblique  elimination  in  cer- 
tain cases.    In  the  second  section,  we  develop  necessary  and  sufficient  conditions  for  the 
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minimal  variable  oblique  elimination  algorithm  to  be  successful  in  computing  tridiagonal 
decompositions  of  a  given  matrix.  In  the  third  section,  we  show  that  the  minimum  vari- 
able oblique  elimination  algorithm  can  be  used  to  compute  tridiagonal  decompositions  of 
the  Fourier  matrices.  Finally,  in  the  fourth  section,  we  derive  expressions  for  the  parallel 
and  serial  operation  counts  of  the  algorithm  implied  by  the  decomposition.  We  also 
derive  expressions  for  the  number  of  operations  required  to  compute  the  tridiagonal 
decompositions  of  a  given  matrix  using  the  minimum  variable  oblique  elimination  algo- 
rithm. 

5.1.  Background  and  Description  of  Oblique  Elimination 

Recall  that,  as  a  special  case  of  the  results  of  Chapter  2,  any  square  matrix  over  the 
real  or  complex  numbers  can  be  factored  into  a  product  of  tridiagonal  matrices.  The 
proof  of  the  theorem  does  not  yield  a  good  algorithm  for  doing  so  however,  and,  as  men- 
tioned in  the  Introduction,  the  problem  of  developing  algorithms  that  will  factor 
matrices  in  such  a  way  is  still  unsolved  in  general.  Tchuente  has  proposed  a  method 
called  oblique  elimination  which  will  work  in  certain  cases  [61].  In  this  section  we 
describe  the  theoretical  basis  developed  by  them  and  then  show  how  an  algorithm  can  be 
derived  from  the  theory. 

Throughout  this  chapter  let  n  be  an  arbitrary  positive  integer  with  n  >  3  and 
assume  that  all  matrices  and  vectors  are  taken  over  the  complex  numbers.  We  could 
relax  the  last  assumption  but  since  we  are  mainly  interested  in  the  Fourier  matrices  we 
do  not  do  so.  In  this  chapter  we  consider  matrices  and  vectors  to  be  ordered  from  1  to  n 
rather  than  0  to  n-1  as  in  the  last  chapter.  When  referring  to  a  matrix  it  will  always  be 
assumed  that  the  entries  of  the  matrix  are  denoted  by  the  same  letter  as  the  matrix 
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unless  explicitly  stated  otherwise. 

Definition  5.1.  Let  A  be  an  n  x  n  lower  triangular  matrix.  We  say  that  A  is  unit  lower 
triangular  if  a,;  =  1  for  i  £  [l,n]. 

Definition  5.2.  Let  A  be  any  n  x  n  matrix.  We  say  that  A  has  an  LU  decomposition  if 
there  exists  an  n  x  n  unit  lower  triangular  matrix  L  and  an  upper  triangular  matrix  U 
such  that  A  =  LU. 

Computing  the  LU  decomposition  of  a  matrix  will  be  the  first  step  in  the  oblique 
elimination  algorithm.  Not  every  matrix  has  an  LU  decomposition.  It  can  be  shown  that 
if  A  is  any  n  x  n  matrix  then  there  exists  an  n  x  n  permutation  matrix  P  such  that  PA 
has  an  LU  decomposition  [15,  61]  The  most  common  method  used  for  computing  LU 
decompositions  is  Gaussian  elimination  with  partial  pivoting.  A  discussion  of  these  topics 
can  be  found  in  almost  any  numerical  analysis  book  [15,  59].  We  remark  that  even  if  a 
matrix  A  has  an  LU  decomposition  it  may  be  desirable,  for  reasons  of  numerical  stabil- 
ity, to  compute  the  LU  decomposition  of  PA  for  some  permutation  matrix  P  rather  than 
that  of  A  . 

Definition  5.3.  Let  A  be  an  n  x  n  lower  triangular  matrix.  Let  i  £  [l,n].  The  ith  oblique 
of  A  is  the  set  {  an_i+11,  an_i+2>2,  .  .  .  ,    ani  }. 

Note  that  we  have  defined  the  obliques  relative  to  the  lower  triangular  matrices. 
We  could  just  as  well  have  done  so  for  upper  triangular  matrices.  Throughout  this 
chapter  we  will  work  mainly  with  lower  triangular  matrices.  The  techniques  can  all  be 
applied  to  the  transposes  of  the  upper  triangular  matrices  that  appear  in  the  decomposi- 
tions. 
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Definition  5.4.  Let  A  be  an  n  x  n  matrix.  We  say  that  A  has  lower  bandwidth  r  if  a^  =0 
whenever  i  >  j  +  r. 

An  n  x  n  lower  triangular  matrix  A  with  lower  bandwidth  n  -  j  for  some  j  6  [l,n-l] 
has  the  property  that  the  kth  oblique  is  {  0  }  for  every  k  €  [l,j]. 

The  oblique  elimination  method  as  applied  to  an  n  x  n  matrix  M  can  be  summar- 
ized in  the  following  two  steps: 

1.)  Determine  a  permutation  matrix  P  such  that  PM  =  LU  is  an  LU  decomposition 
of  PM. 

2.)  Construct  matrices  Lf  ,  L2-1,  •  ■  ■  ,  Ln~_2  such  that  for  every  j  (E  [l,n-2]  the  matrix 
Lj  is  unit  lower  bidiagonal  and  the  matrix  Lj_1LjlJ  •  ■  •  Lj_1L  has  lower  bandwidth  n  -  j. 
Similarly,  construct  matrices  Uf1,  U2-1,  .  .  .  ,  U~_2  such  that  for  every  j  6  [l,n-2]  the 
matrix  Uj1  is  unit  lower  bidiagonal  and  the  matrix  (Uj1)-1  (Uj^)-1  •  ■  •  (Ui)'1  Ul  has  lower 
bandwidth  n  -  j.  Then 

A  =  L„_2Ln_3  •  •  •  Lj  L 
B  =  UUf  kJf1  •  •  •  U^2 
are  lower  and  upper  bidiagonal  matrices  respectively.  Thus, 

M  =  P-%U  ■  ■  ■  Ln_2ABUn_2Un_3  ■  •  •  Ua 
is  a  tridiagonal  decomposition  of  M.    This  methodology  does  not  always  work  because 
one  can  not  always  construct  the  matrices  Uj  and  Lj.    A  method  for  doing  so  is  the  fol- 
lowing: 

Let  Xj,  x2,  .  .  .  ,   x^j  denote  indeterminates  and  let  X  denote  the  n  x  n  matrix 


X   = 


1 

0 

• 

0 

*1 

1 

0 

0 

_x2 

1 

0 

x3 

0 

0 

1 

~xn-2 

0 

1   0 

0 

0 

-*n-l  ! 

Then,  since  In  -  X  is  nilpotent, 
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x-^u-n-x))-1  = 


xlx2 


1 

0 

. 

0 

xl 

1 

0 

xlx2 

*•: 

1 

0 

XiX2X3 

X2X3 

x3 

1 

0 
1   0 

2  '  '  '  xn-l 

x2 

'  xn-l 

x3 

'  xn-l 

xn-l  1 

Thus,  one  can  attempt  to  construct  the  matrices  Lj ~ ,  L2_1,  .  .  .  ,  Ln~2  and 
(Uj1)-1,  (Uj^j)-1,  .  .  .  ,  (Uj1)-1  in  the  form  of  X"1.  Let  A  be  an  n  x  n  lower  triangular 
matrix  with  lower  bandwidth  n-i+1.  Then  there  exists  an  n  x  n  matrix  X"1  as  above 
such  that  the  matrix  B  =  X_1A  has  lower  bandwidth  n-i  if  and  only  if  the  nonlinear  sys- 
tem of  equations 

xlx2  '  '  '  xn-ial,l  +  X2X3  '  "  '  xn-ia2,l  +  '   '  +  xn-ian-i,l  +  an-i+l,l  =  ° 

X2X3  '  '  '  xn-i+la2,2  +  X3X4  '   '  xn-i+la3,2  +  "  "  '  +  xn-i+lan-i+l,2  +  an-i+2,2  =  0 


X;X 


iAi+l 


cn-lai,i  +  xi+lxi+2  '      '  xn-lai+l,i  +     '  '  '     +  xn-lan-l,i  +  an,i  —  0 


has  a  solution  in  the  indeterminates  x1;  x2,  .  .  .  ,   xn_!.    This  can  be  seen  by  writing  the 
product  X~  A  out  elementwise  and  setting  the  appropriate  terms  equal  to  zero. 
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In  this  dissertation  we  employ  an  algorithm  which  attempts  to  solve  the  system 
using  a  minimal  number  of  variables.  We  call  the  algorithm  minimal  variable  oblique 
elimination.   If  we  take  Xj  =  x2  =    ■  •  ■    =  xn_;+1  =  0,  then  the  system  reduces  to 

xn-ian-i,l  +  an-i+l,l  =  0 

xn-ixn-i+lan-i,2  +  xn-i+lan-i+l,2  +  an-i+2,2  ==  0 

xn-ixn-i+l  '      '  xn-lan-i,i  +  xn-i+lxn-i+2  "      '  xn-lan-i+l,i  +  "+"  xn-lan-l,i  +  an,i  ==  "■ 

A  solution  to  this  system,  if  it  exists,  is  given  by 

an-i+l,l 


a 


S-i+l 


n-i,l 

an-i+2,2 
xn-ian-i,2  +  an-i+l,2 


xn-l  — 


xn-ixn-i+l  '  '  '  xn-2an-i,i  +     '  '  '     +  Xn-2an-2,i  +  a-n-l,i 

We  shall  refer  to  this  particular  solution  set  as  S.  Note  that  the  existence  of  the  solution 
set  S  is  not  equivalent  to  the  existence  of  a  solution  to  the  original  system  of  equations, 
even  with  x1  =  x2  =  •  •  ■  =  xn_i+1  =  0,  since  if  A  is  the  zero  matrix,  then  the  system 
is  solved  trivially  but  the  solution  set  S  is  undefined.  We  shall  concern  ourselves  with 
determining  conditions  under  which  this  solution  exists. 

Definition  5.5.  Let  A  be  an  n  x  n  lower  triangular  matrix  with  lower  bandwidth  n-i+1. 
If  the  above  solution  exists  for  A,  then  we  call  it  the  minimal  variable  solution.  If  X  is 
constructed  from  this  solution  in  the  fashion  described  in  this  section,  then  we  say  that 
B  =  X_1A  is  computed  from  A  using  minimal  variable  oblique  elimination.  X  is  called 
the  minimal  variable  solution  matrix  for  A. 
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Definition  5.6.  Let  L  be  an  n  x  n  lower  triangular  matrix  and  denote  L0  =  L"  .  We  say 
that  minimal  variable  oblique  elimination  is  successful  for  L  if  there  exists  matrices 
Lj,  L2,  .  .  •  ,  Ln_2  such  that  Lj  is  the  minimal  variable  solution  matrix  for 
Li-iLi-2  •  •  •  LflLo  for  every  'l  €  [M-2]. 

Definition  5.7.  Let  A  be  any  n  x  n  matrix.  We  say  that  minimal  variable  oblique  elimina- 
tion is  successful  for  A  if  the  following  two  conditions  hold: 

1.)  A  has  an  LU  decomposition  A  =  LU. 

2.)  Minimal  variable  oblique  elimination  is  successful  for  L  and  Ul. 

In  the  next  section  we  derive  necessary  and  sufficient  conditions  for  minimal  variable 
oblique  elimination  to  be  successful. 

5.2.  Necessary  and  Sufficient  Conditions  for  the  Minimal  Variable  Solution  to  Exist 

In  this  section  we  derive  necessary  and  sufficient  conditions  for  the  minimal  vari- 
able solution  to  exist  for  a  lower  triangular  banded  matrix  A.  We  will  show  that  the 
solution  exists  if  and  only  if  certain  submatrices  of  A  are  invertible.  We  then  use  these 
conditions  to  develop  necessary  and  sufficient  conditions  for  minimal  variable  oblique 
elimination  to  be  successful  for  any  square  matrix. 

We  introduce  some  notation  which  will  be  used  throughout  this  section.  If  B  is  a 
square  matrix,  then  det(B)  denotes  the  determinant  of  B.  Fix  i  £  [0,n]  and  let  A  be  a 
lower  triangular  matrix  with  lower  bandwidth  n-i+1,  that  is, 
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A   = 


4,1 


0 


a2,2 
an-i+l,l 

0  an-i+2,2 

0 

0  0 


0  an; 


For  each  k  6  [l,i],  let  Mk  denote  the  k  x  k  submatrix  of  A  defined  by 


Mk    = 


an-i,l         ^n-i^ 

^n-i+l.l  an-i+l,2 

0  an-i+2,2 
0  0 


an-i,k-l  an-i,k 

an-i+l,k-l        an-i+l,k 
an-i+2,k-l         an-i+2,k 


0  0.0   a^j+^jk.!   a„_i+k_lk 

For  convenience  we  take  M0  =  (1).  Note  that  the  Mk  are  in  upper  Hessenberg  form. 
We  will  show  that  the  minimum  variable  solution  exists  for  A  if  and  only  if  the  Mk  are 
all  invertible. 

For  every  k  £  [0,i-l]  we  formally  define  the  symbol  dn_i+k  by 

k-l  k-l 

dn-i+k  —  (IIxn-i+j)an-i,k  +  l  +  (IIxn-i+j)an-i+l,k+l  +    '  '  "    +  xn-i+k-lan-i+k-l,k+l  +  an-i+k,k  +  l- 
j=0  j=l 

The  above  definition  is  formal  in  the  sense  that  if  upon  substituting  values  for  the  sym- 
bols xn_j+j  in  a  sequential  fashion  it  happens  that  dn_;+j  =  0  for  some  j,  then  xn_i+k  is 
undefined  for  k  >  j.  To  show  that  the  minimal  variable  solution  exists  it  must  be 
shown  that  the  dn_i+k  are  all  defined  and  nonzero. 

In  what  follows,  some  of  the  proofs  are  by  induction.  For  purposes  of  illustrating 
the  reasons  for  the  truth  of  the  theorems,  the  first  several  cases  are  sometimes  shown  to 
be  true  even  though  this  is  not  a  logical  necessity. 
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Lemma  5.8.  Assume  that  there  exists  an  I  €   [l,i]   such  that  det(Mk)  7^  0  for  every  k 

det(Mk+1) 


6  [0,/-l].    Then  for  every  such  k,  dn_i+k  exists  and  dn_i+k  = 


det(Mk) 


Proof. 


The  proof  is  by  induction  on  k.    If  k  =  0  then 


dn-i  —  an-i, 


det(M!) 


",>1     "  det(Mo)" 
If  /  =  1  then  we  are  done.    Otherwise  dn_i  7^  0  by  assumption.  If  k  =  1  then 


,  an-i+i,i  det(M2) 

dn-i+l  —  _  an-i,2  +  an-i+l,2  — 


an-i,i  det(Mj)  ' 

Assume  that  the  lemma  is  true  for  dn_j,  dn_j+1,   ...,  d„_j+k_i  for  some  k  €   [l,'-l]- 

Then  each  xn_i+j  for  j  €  [0,k-l]  is  defined  so  dn_,+k  is  defined  and  is  given  by 

k-l  k-l 

dn-i+k  =  (Ilxn-i+j)an-i,k+l  +  (ilxn-i+j)an-i+l,k+l  +        "  "    +  xn-i+k-lan-i+k-l,k+l  +  an-i+k,k+l- 
j=0  j=l 

By  the  induction  hypothesis, 


dn-i+k  -  ("if 


(-1) 


k-l 


an-i+l,l 

det(Mi) 


an-i+2,2 


det(M2) 
det(Mj) 


"n-i+2,2 

det(M2) 

det(Mi) 


an-i+3,3 

det(M3) 
det(Mo) 


an-i+3,3 

det(M3) 
det(M2) 


an-i+k,k 

det(Mk) 
det(Mk_!) 


an-i+k,k 

det(Mk) 
det(Mk_x) 


ln-i+l,k+l 


an-i,k+l  + 


+ 


+    •  •  •     +  (-1) 


(-l)k 


llan-i+j,j 


an-i+k,k 

det(Mk) 
det(Mk_i) 

an-i,k+l 


an-i+k-l,k+l  +  an-i+k,k+l  — 


det(Mk) 


rian-i+j,j 
j=2 


an-i+l,k+l  + 
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•  •  •   +  (-l)det(Mk_1)an_i+kikan_i+k_1|k+1  +  det(Mk)an_i+k>k+1 

det(Mk) 

Since   Mk+1    is   in    upper   Hessenberg   form,    the    numerator   of   the   last   expression    is 
det(Mk+1)  expanded  along  the  column  k+1.    This  proves  the  lemma. 

Q.E.D. 

Lemma  5.9.  Assume  that  there  exists  an  I  E  [0,i-l]  such  that  the  dn_i+k  are  defined  and 
nonzero  for  every  k  6  [0,/].    Then  det(Mk)  7^  0  for  every  k  €  \0,l\. 

Proof. 

The  proof  is  by  induction  on  k.    If  k  =  0,  then  det(Mo)  =  1.    If  /  =  0,  then  we  are 
done.  Otherwise  if  k  =  1,  then  de^Mj)  =  an_j  j  =  dn_;  7^  0.    If  k  =  2  then 

dn_i+1  =  -     n~'+U  an_it2  +  an_i+1|2  = det(M2)  7^  0. 

an-i,l  an-i,l 

Assume  that  the  lemma  is  true  for  M0,  M1(  .  .  .  ,   M);^  for  some  k  6  [2,/].    Since 
det(M0),  det(Mj),  .  .  .  ,   det(Mk_!)  7^  0,  by  the  calculations  of  the  previous  lemma, 

det(Mk) 
dn-i+k"1        dettMn) ' 

By  hypothesis,  dn_i_hk_1  7^  0  so  we  may  conclude  that  det(Mk)  7^  0  which  proves  the 
lemma. 

Q.E.D. 

Lemmas  5.8  and  5.9  can  be  combined  to  yield  necessary  and  sufficient  conditions 
for  the  minimal  variable  solution  to  exist. 
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Theorem  5.10.    The  minimal  variable  solution  exists  for  A  if  and  only  if  the  matrices 
Mj,  M2,  ■  •  ■  ,  Mj  are  invertible.  In  this  case  we  have 

det(Mk) 


xn-i+k  —       an-i+k+l,k+l 

for  every  k  £  [0,i-l]. 


det(Mk+1) 


Proof. 

Assume  that  the  minimal  variable  solution  exists  for  A.  Then  dn_i+k  7^  0  for  every 
k    £    [0,i-l].     By    Lemma   5.9,    det(Mk)  7^   0   for   every    k    £    [0,i-l].     By   Lemma   5.8, 

det(Mk+1) 

dn_i+k  = —  which  yields  the  desired  expression  for  xn_i+k. 

det(Mk) 

Conversely,  assume  that  the  matrices  M1(  M2,  .  .  .  ,  Mi  are  invertible.  Then,  by 
Lemma  5.8,  for  every  k  £  [0,i-l],  dn_i+k  exists  and  is  nonzero.  Hence,  the  minimal  vari- 
able solution  exists. 

Q.E.D. 

We  now  assume  that  L  =  L0  is  lower  triangular  and  that  A  is  of  the  form 
A  =  LjlJ,  Lj^  •  •  •  LXL0  where  Llt  L2,  .  .  .  ,  and  Lj_j  have  been  constructed  using 
minimal  variable  oblique  elimination.  In  the  next  theorem,  we  establish  criteria  which 
will  allow  us  to  deduce  the  existence  (or  non-existence)  of  the  minimal  variable  solution 
for  A  using  information  contained  in  the  matrix  L.  We  will  show  that,  due  to  the  sparse 
structure  of  the  Lj,  one  of  the  matrices  Mk  will  be  singular  if  and  only  if  certain  subma- 
trices  of  L  are  singular. 

Definition  5.11.  If  L  is  an  n  x  n  lower  triangular  matrix,  then  for  every  pair  of  integers 
(j,k)  with  j  £  [l,n-l]  and  k  £  [l,n-j]  define 
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<j,i       • 

•         k 

Lkj   — 

'j+k-1,1    • 

•     (j+k-l,k 

Similarly,  if  U  is  an  n  x  n  upper  triangular  matrix,  then  define 


Ukj  s  l(U%l 
Theorem  5.12.  Assume  that  L  =  L0  is  an  n  x  n  lower  triangular  matrix  and  that  either  i 
=  1  or  i  6  [2,n-2]  and  n  x  n  matrices  Lh  L2,  .  .  .  ,  and  L;_j  Aave  6een  constructed  using 
minimal  variable  oblique  elimination.  Suppose  that  A  =  LjIiLjIj  '  '  '  Lf  Lo  an<^  that  the 
minimal  variable  solution  does  not  exist  for  A.  Let  j  be  the  smallest  positive  integer  such 
that  Mj  is  singular.  Then  Ljn_;  is  singular. 


Proof. 


We  show  that  det(Ljn_i)  =  0.    By  construction,  for  k  6  [1  ,i-l] , 


I„-k 

0 

0 

• 

0 

0 

.    0- 

-x£J  1 

1 

0 

0 

0 

-x«-kl+1 

1 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

-xi_k]      1 

where  {-xn-Ic+j}j=o  denotes  the  minimal  variable  solutions  for  the  matrix 
LtiLjj^j  '  '  '  Li_1L-  We  use  the  notation  R^  — ♦  Rk  +  aRm  to  express  the  fact  that  row  k 
of  a  matrix  is  replaced  by  itself  plus  a  multiple,  a,  of  row  m.  If  B  is  any  n  x  n  matrix 
and  k  £  [  1 , i- 1  ]  then  multiplying  B  on  the  left  by  L^  has  the  effect: 
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Rm   — *   Rn 


for   m  £  [l,n-k] 


r     _,  R     -X(k),R     ,  for   m  e  [n-k+l,n]     ' 

Thus,  multiplying  A  on  the  left  by  L^  leaves  rows  1  to  n-i+1  fixed  and  replaces 
rows  n-i+2  to  n  of  A  with  a  linear  combination  of  them  and  the  row  directly  above 
them,  multiplying  L^jA  on  the  left  by  Lj_2  leaves  rows  1  to  n-i+2  fixed  and  replaces  rows 
n-i+3  to  n  of  Lj_xA  with  a  linear  combination  of  them  and  the  row  directly  above 
them,...,  multiplying  Li_j+1  ■  •  •  Lj_jA  on  the  left  by  Lj_j  leaves  rows  1  to  n-i+j  fixed  and 
replaces  rows  n-i+j  to  n  of  Ls_j+1  •  •  •  Lj.jA  with  a  linear  combination  of  them  and  the 
row  directly  above  them.    Since  the  other  left  multiplications  by  Lj,  L2,  .  .  .  ,   L,_j  leave 


rows  n-i,  n-i+1,  ...,  n-i+j-1  of  Li_j+1  •  •  •  L^A  unchanged  and  L 


i-1 

riLm 

m=l 


A  it  follows 


that  rows  n-i,  n-i+1,...,  n-i+j-1  of  L  are  linear  combinations  of  rows  n-i,  n-i+1,...,  n-i+j-1 
of  A.  In  fact,  we  can  write  Ljn_i  =  LiL2  '  '  '  Lj_iMj  where 


Ii-k 

0 

0 

0 

0    . 

.    0- 

-x$    1 

1 

0 

0 

0 

-xn(^+1 

1 

0 

0 

0 

0 

-x$+2 

1 

• 

• 

0 

0 

0 

0 

0 

-Xl(k'-  • 

-H+j-i 

1 

That  is,  Lit  is  the  submatrix  of  Lk  occupying  the  same  position  as  Mj  does  in  A.  Since 
each  Lk  is  unit  lower  triangular,  det^)  =  1.  Hence,  det(Lj  n_i)  =  det(Mj)  =  0  which 
proves  the  theorem. 

Q.E.D. 
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Note  that  if  M  =  LU  is  an  LU  decomposition  of  a  n  x  n  matrix  M,  then  the 
theorem  holds  for  both  L  and  Ul. 

We  are  now  in  a  position  to  state  necessary  and  sufficient  conditions  for  minimal 
variable  oblique  elimination  to  be  successful  for  an  arbitrary  n  x  n  matrix.  It  is  immedi- 
ate from  the  definition  that  the  matrix  must  have  an  LU  decomposition.  Therefore,  we 
assume  this  condition  in  the  next  theorem. 

Theorem  5.13.  Assume  that  B  is  an  n  x  n  matrix  with  LU  decomposition  B  =  LU. 
Minimal  variable  oblique  elimination  is  successful  for  B  if  and  only  if  the  submatrices  Lkj 
and  Ufcj  o/L  and  U  are  invertible  for  every]  G  [2,n-l]  and  k  G  [l,n-j]. 

Proof 

Assume  that  the  minimal  variable  oblique  elimination  method  is  successful  for  B 
and  assume  by  way  of  contradiction  that  there  exists  j  £'[l,n-l]  and  k  G  [l,n-j]  such  that 
Lk;  is  not  invertible.  Denote  L  =  L0.  By  Definitions  5.6  and  5.7,  there  exists  matrices 
Lj,  L2,  .  •  -  ,  Ln_2  such  that  for  i  G  [l,n-2],  Lj  is  the  minimal  variable  solution  matrix  for 
Li-iLj-2  •  •  •  LfJL.  Let  A  =  L^A_iL^_2  ■  ■  ■  Lj  L.  As  was  shown  in  the  proof  of 
Theorem  5.12,  det(Lkj)  =  det(Mk).  By  Theorem  5.10,  Mk  is  invertible  so  det(Mk)  7^  0. 
Since  Lkj  is  assumed  to  be  singular,  det(Lkj)  =  0  which  is  a  contradiction.  The  same 
argument  can  be  applied  to  Ukj. 

Conversely,  assume  that  for  every  j  G  [l,n-l]  and  k  G  [l,n-j]  the  matrices  Lkj  and 
Ukj  are  invertible  and  that  minimal  variable  oblique  elimination  is  not  successful  for  B. 
Then  it  is  not  successful  for  either  L  or  Ul.  Assume  that  it  is  not  successful  for  L.  Let  i 
be  such  that  we  are  able  to  construct  the  matrices  Llt  L2,..,  Lj_j  using  minimal  variable 
oblique  elimination  and  such  that  the  minimal  variable  solution  does  not  exist  for  A  = 


133 

Li"iLi-"22  '  '  '  LflL-  Let  J  be  the  smallest  integer  such  that  Mj  is  not  invertible.  By 
Theorem  5.12,  Ljn_j  is  not  invertible.  By  definition  of  Mj,  1  <  j  <  i  <  n-1  so  n-j  >  1. 
Therefore,  Lj  n_j  is  invertible  which  is  a  contradiction.  If  minimal  variable  oblique  elimi- 
nation is  not  successful  for  Ul,  then  the  same  argument  can  be  applied. 

Q.E.D. 

We  have  developed  criteria  which  can  be  applied  to  the  factors  of  the  LU  decompo- 
sition of  a  matrix  to  determine  whether  or  not  oblique  elimination  will  be  successful  for 
that  matrix.  This  criteria  is  much  easier  to  use  than  checking  to  see  if  the  original  sys- 
tem of  nonlinear  equations  has  a  solution  at  each  step.  The  conditions  stated  in  Theorem 
5.13  are  stringent.  One  can  easily  think  of  a  great  many  examples  of  matrices  which  do 
not  satisfy  the  conditions.  Fortunately,  all  of  the  Fourier  matrices  do  satisfy  the  condi- 
tions.  In  the  next  section  we  prove  this  assertion. 

5.3.  Application  of  Minimal  Variable  Oblique  Elimination  to  the  Fourier  Matrices 

In  this  section  we  show  that  minimal  variable  oblique  elimination  is  successful  for 
any  Fourier  matrix.  In  fact,  we  show  that  the  technique  is  successful  for  PFn  where  P  is 
any  permutation  matrix.  We  first  show  that  PFn  has  an  LU  decomposition  for  any  per- 
mutation matrix  P.  We  then  show  that  these  LU  decompositions  all  satisfy  the  condi- 
tions of  Theorem  5.13  of  the  previous  section.  Recall  that  we  denote  u  =  un  = 
exp[-27ri/n]  where  i  =  v/-l. 

Definition  5.14.  Let  B  be  any  n  x  n  matrix  and  let  k  6  [l,n].    The  leading  k  x  k  principal 
submatrix  o/B  is  the  matrix 
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B* 


bl,l     bl,2 
b2,l     b2,2 

bk,l    bk,2 


bl,k 
b2,k 


}k,k 


Lemma  5.15.  Let  k  £  [l,n]  and  let  sh  s2,  .  .  .  ,   sk  £  [0,n-l]  oe  distinct.  Let  j  6  [0,n-l] 
suc/i  that  j+k-1  <  n.    For  eac/i  i  6  [l,k],  <fe/me 


-  /    (J+i-l)si       (j+i-l>2  ,  ,(J+i"l)sk  -i 

The  set  {vJjLj  is  a  linearly  independent  set. 


Proof. 


Assume  that  the  lemma  is  false,  that  is,  that  there  exists  constants  c1;  c2,  .  .  .  ,   ck 


not  all  zero  such  that  J^CjV*;  =  0-    Writing  the  sum  out  yields 


i=l 


2s 


2s 


(k-l)s, 


u>     (cx  +  c&j    +  c3oj      +    •  •  •    +  cka/         )  =  0 


(k-l)s2 


J"2  (Cl  +  c^2  +  c3u2  +    •  •  •    +  c^~Lp2)  =  0 


J*  (Cl  +  C^""  +  CgO,25*  +     •    •    •     +  CkJk-1,Sk)   =  0. 

Define  a  polynomial  p(x)  £  C[x]  by  p(x)  =  cx  +  c2x  +  •  ■  •  +  ckx  .  Then,  since 
Sj  7^  Sj  if  i  t^  j  and  s,  £  [0,n-l]  for  every  i  £  [l,n],  the  set  {  u>  \  u  2,  .  .  .  ,  co  k  }  is  a  set  of 
k  distinct  roots  of  p(x).  By  definition  of  p(x),  deg(p(x))  <  k-1  which  implies  that  p(x)  is 
the  zero  polynomial.  Therefore,  Cj  =  c2  =    •  •  ■    =  ck  =  0  which  is  a  contradiction. 

Q.E.D. 


Theorem  5.16.  Let  P  be  an  n  x  n  permutation  matrix.    Every  leading  k  x  k  principal  sub- 
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matrix  o/PFn  is  invertible. 


Proof. 

Assume  that  P  represents  the  permutation  a,  that  is,  P  =  Pa.  Since  in  this 
chapter  we  consider  matrices  and  vectors  to  be  ordered  from  1  to  n  rather  than  0  to  n-1, 
we  consider  a  as  acting  on  the  set  {  1,2,  ...  ,n  }.    Denote 


"T, 


F    = 

1  n 


where?;  m  (1,  J1"1',  ^~l\  .  .  .  ,   J""1^1)).  Then 


Mi) 
Q2) 


PFr 


Let  k  £  [l,n]  and  let  Bk  denote  the  leading  k  x  k  principal  submatrix  of  PFn.  Thus, 
letting  Sj  =  <r(i)— 1, 


Bt   = 


1     OJ        U) 


1s>o 
WW 


2s- 


..        Si.       2su 
1     W*    W 


w 


w 


(k-l)s, 
(k-l)s2 


W 


(k-l)sk 


Since  <r  is  a  permutation,  the  s;  are  distinct.    By  Lemma  5.15,  the  columns  of  Bk  are 
linearly  independent. 

Q.E.D. 
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Corollary  5.17.  Let  P  be  an  n  x  n  permutation  matrix.  The  matrix  PFn  has  an  LU 
decomposition.  In  particular,  Fn  has  an  LU  decomposition. 

Proof. 

Theorem  5.16  shows  that  PFn  satisfies  the  required  conditions  [15]. 

Q.E.D. 

Theorem  5.18.  Let  P  be  an  n  x  n  permutation  matrix  and  assume  that  PFn  =  LU  is  the 
LU  decomposition  o/PFn.  Let  j  €  [l,n-l],  k  6  [l,n-j],  and  Ljk  and  Ujk  be  as  in  Definition 
5.11.    The  matrices  h-.^  and  U=k  are  invertible. 

Proof. 

We  first  show  that  Ljk  is  invertible.   Denote 

"ik  =  ("n,  0,...,  0)1 
"2k  =  ("12,  "22,  Of— i  0)1 

"3k  s  ("13-  "23,  "33,  0, ».,  0)1 

"kk  —  ("lk,  "2k,  "3k,  •■  •  ,    "kk)  • 
Since     Fn     is     nonsingular     and     L     is     unit     lower     triangular,     det(Fn)     = 

det(p-1)det(L)det(U)  =  detfP-^detfU)  ^  0.    Hence,  det(U)  =  TJ"ii  7^  °-    This  implies 

i=l 
that  the  scalars  uu,  u22,  .  .  .  ,    uklt  are  nonzero  and  therefore  that  {uik}i=1  is  a  linearly 
independent  set.    Since  a  square  matrix  is  invertible  if  and  only  if  it  takes  a  basis  to  a 
basis,  it  is  sufficient  to  show  that  {Ljt/UjjJiLj  is  a  linearly  independent  set. 
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For  i  e  [l,k],  let  %  =  (v^,  v2i,  .  .  .  ,  v^)1  =  Ljkuik.  Since  Ljk  is  k  x  k  and  the  first 
k  columns  of  U  have  zeroes  in  rows  k+1,  k+2,  ...,  n,  vmi  is  the  (j+m-l,i)  element  of  PFn. 
Hence,  letting  sm  =  cr(j+m-l)-l  so  that  vmi  =  J1',  we  have 

V,  =  (1,  1,...,  I)1 

V2  «  (of1,  J* uj 

—  /    2si       2s2  2sku 

V3  =  (u    \  W     ,...,«      ) 

Mc  =  lw  .  w  i  •  •  •  >    w  J 

By  Lemma  5.15  with  j  =  0,  {Vj}jLi  is  a  linearly  independent  set.    Hence,  Ljk  is  inverti- 

ble. 

We  now  show  that  Ujk  is  invertible.  The  argument  is  almost  the  same  but  some 
modifications  are  necessary.  Denote 

7lk  ■  (/n,  o,...,  o)1 

'2k  —  Cl2i  ^22.  0>— 1  0) 
'3k  —  Cl3>  ^23.  '33.  Of— 1  °) 

'kk  —  Clk>  *2k>  kk>  ■  ■  •  >    'kk)  ■ 
Since  L  is  unit  lower  triangular,  {tyJjLi  is  a  linearly  independent  set.    We  show  that 

{Ujk»ik}i=i  >s  a  linearly  independent  set.    For  i  £  [l,k],  let  v*j  =  (vlh  v2;,  .  .  .  ,   vki)1  = 

Ujkl-k.    Then  vmi  is  the  (j+m-l,i)  element  of  (PFn)1  and  is  therefore  the  (i,j+m-l)  element 

of  PFn.    Hence,  letting  s;  =  a(i)-l,  we  have 

*    .  _  (.  >!    .  ,0+l)s,  (j+k-l)s,u 

Vl  —  («     i«  >  •  •  •  »    W  J 
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-t  i    js2       (j+l)s2  (J+k-l)s2u 

v2  =  (w    ,  u  ,  .  .  .  ,   u  ) 


*    _  /,  )sk    ,  ,0+1)sk  ,   (j+k-l)si«t 

Vi.=[LO      ,0J  ,   .   .   .   ,     W  j 

Let  M  be  the  k  x  k  matrix  having  Vj  in  the  ith  row.  Since  the  column  rank  of  a 
matrix  is  equal  to  the  row  rank,  the  v*j  will  be  linearly  independent  if  and  only  if  the 
columns  of  M  are.  By  Lemma  5.15,  the  columns  of  M  are  linearly  independent.  Hence, 
Ujk  is  invertible. 

Q.E.D. 

The  next  corollary  is  the  main  result  of  this  section. 

Corollary  5.19.  Let  P  be  an  n  x  n  permutation  matrix.  Minimal  variable  oblique  elimina- 
tion is  successful  for  PFn;  that  is,  there  exists  unit  lower  bidiagonal  matrices 
Lj,  L2,  •  •  •  ,  Ln_o,  unit  upper  bidiagonal  matrices  Uj,  Uo,  .  .  .  ,  Un_o,  and  bidiagonal 
matrices  B  and  C  such  that 

PFn  =  LXL2  •  •  •  Ln_2BCUn_2Un_3  •  •  ■  Ux. 

Proof. 

By  Theorem  5.18,  for  every  j  €  [l,n]  and  k  £  [l,n-j+l],  the  matrices  Ljk  and  Ujk  are 
invertible.  By  Theorem  5.13,  this  implies  that  minimal  variable  oblique  elimination  is 
successful  for  PFn. 

Q.E.D. 

The  next  theorem  is  of  interest  for  implementation  purposes. 
Theorem  5.20.  For  every  n  >  2,  the  matrices  B  and  C  in  Theorem  5.19  can  be  chosen 
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such  that  Uj  =  Ljl  for  j  6  [l,n-2],  that  is, 

Fn  =  LiL2  •     •  Ln_2BCLn_2Ln_3  •     •  Lj . 

Proof. 

Let  Fn  =  LU  be  the  LU  decomposition  of  Fn.  Since  Fn  is  symmetric,  U  =  DLl 
where  D  is  a  diagonal  matrix  [15].  Suppose  that  L  =  LjL2  ■  •  ■  Ln_2B  is  the  tridiagonal 
decomposition  of  L  computed  using  minimal  variable  oblique  elimination.  Then  Ul  =  LD 
—  LjL2  •  •  •  Ln_2BD.   Take  C  =  DBl. 

Q.E.D. 

We  have  shown  that  minimal  variable  oblique  elimination  can  be  used  to  compute 
tridiagonal  decompositions  of  the  Fourier  matrices.  The  decompositions  need  to  be  com- 
puted only  once  for  each  n.  The  factors  of  the  decomposition  can  then  be  stored  per- 
manently. Thus,  a  library  of  parallel  algorithms  for  computing  discrete  Fourier 
transforms  of  any  size  (within  some  upper  bound)  should  be  easy  to  construct  using  this 
technique.  The  same  program  can  be  used  for  any  n,  which  is  far  from  the  case  with 
FFT-based  methods. 

5.4.  Complexity  Considerations 

In  this  section  we  derive  expressions  for  the  time  complexity  of  the  parallel  and 
serial  algorithms  resulting  from  the  decompositions  developed  in  this  chapter.  We  also 
establish  upper  bounds  on  the  number  of  floating  point  operations  required  to  compute 
the  decompositions  using  minimal  variable  oblique  elimination.  Recall  that  the  algo- 
rithms resulting  from  the  decompositions  are  generally  required  to  be  real-time  algo- 
rithms whereas  the  computation  of  the  tridiagonal  factors  need  not  be. 
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Theorem  5.21.  Let  M  be  an  n  x  n  matrix  and  assume  that  minimal  variable  oblique  elimi- 
nation is  successful  for  M.  It  takes  2n-2  parallel  steps,  all  steps  but  one  consisting  of,  at 
most,  one  multiplication  and  one  addition  per  point  and  the  other  step  consisting  of  at 
most  two  multiplications  and  one  addition  per  point,  to  implement  the  transformation 
V*  —*■  Mv  using  the  tridiagonal  decomposition  ofM. 

Proof 

The  result  follows  immediately  from  Corollary  5.19. 

Q.E.D. 

We  shall  use  the  concept  of  a  flop  (floating  point  operation)  to  quantify  the  com- 
plexity of  a  serial  computation.  Golub  and  van  Loan  define  a  flop  to  be  "more  or  less  the 
work  associated  with  the  statement  s  :=  s+aikbkj"  [15].  We  take  s,  aik  and  bkj  to  be 
complex  numbers.  Note  that  it  takes  at  least  three  real  multiplications  and  five  addi- 
tions or  four  real  multiplications  and  two  additions  to  perform  a  complex  multiplication 
[6]- 

Theorem  5.22.  Let  M  be  an  n  x  n  matrix  and  assume  that  minimal  variable  oblique  elimi- 
nation is  successful  for  M.  The  mapping  v*  —*  Mv  can  be  accomplished  with  n2+2  flops 
using  the  decomposition  resulting  from  oblique  elimination. 

Proof. 

If  x  6  Cn  and  y  =  LjX  for  some  i  £  [l,n-2],  then,  by  the  way  the  L;  are  constructed, 
it  takes  one  flop  to  compute  yk  if  k  >  n-i+1,  and  zero  otherwise.  The  same  statement 
holds  true  for  the  Uj.  If  y  —  Cx,  then  it  takes  two  flops  to  compute  yk  for  k  6  [l,n].  If 
y  =  Bx*,  then,  since  B  is  unit  lower  bidiagonal,  it  takes  one  flop  to  compute  yk  for  k  G 
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[l,n].    Hence,  it  takes  a  total  of  2£i  +  3n  =  n2  +  2  flops. 

i=l 

Q.E.D. 

The  operation  count  associated  with  using  the  tridiagonal  decompositions  to  imple- 
ment linear  transforms  on  a  serial  machine  is  essentially  the  same  as  the  operation  count 
of  the  straightforward  method.  Therefore,  these  decompositions  do  not  lead  to  good 
serial  algorithms  for  computing  linear  transforms  in  general. 

We  now  show  that  if  minimal  variable  oblique  elimination  is  successful,  then  it 
takes  o(n3)  flops  to  compute  the  decomposition.  The  computations  required  to  compute 
the  decompositions  can  be  divided  into  three  categories.  One  category  consists  of  the 
computations  necessary  to  compute  the  LU  decomposition.  The  other  categories  consist 
of  the  computations  necessary  to  construct  the  minimal  variable  solution  matrices  and 
the  matrix  multiplications  necessary  to  compute  the  intermediate  results,  that  is,  the 
multiplications  of  the  form  Lf1(LilJ  •  •  •  LfJL).  The  first  category  is  well  studied.  We 
examine  the  latter  two  categories. 

We  first  consider  the  computations  necessary  to  construct  the  minimal  variable 
solution  matrices.  In  the  next  theorem,  we  establish  an  upper  bound  on  the  number  of 
flops  required  to  compute  the  minimal  variable  solution  matrix  for  a  lower  triangular, 
banded  matrix  A  assuming  that  a  Horner  type  algorithm  is  used. 

Theorem  5.23.  Let  i  £  [l,n-2]  and  let  A  be  an  n  x  n  matrix  with  lower  bandwidth  n-i+1. 
For  each  k  E  [0,i-l]  assume  that  xn_i+it  is  computed  by  first  computing  dn_i+k  in  the  nested 
fashion 

"n-i+k  —  xn-i+k-l  (  xn-i+k-2  (  (xn-i+l  (  xn-ian-i,k+l  +  an-i+l,k+l  J  + 
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+  an-i+2,k  +  l  )  +  '  '  "    +  an-i+k-l,k+l  )  +  an-i+k,k+lan-i+k,k+l 
and  then  computing 


an-i+k+l,k+l 
xn-i+k  —  ~ 


dn-i+k 
It  takes  -J L  flops  to  compute  the  minimal  variable  solution  for  A. 

Proof. 

It  takes  k  flops  to  compute  dn_j+]j,  one  for  each  xn_;+j,  j  =  0,  1,  ...  ,  k-1.    It  takes 

one  more  division  to  compute  xn_i+k.    Hence,  it  takes    £]  (k+1)  =  '    flops  alto- 

k=o  2 

gether. 

Q.E.D. 

Corollary  5.24.  Let  M  be  an  n  x  n  matrix  and  assume  that  minimal  variable  oblique  elimi- 
nation is  successful  for  M.  Using  the  method  of  Theorem  5.23,  it  takes  — n(n-l)(n-2)  = 
o(n  )  flops  to  construct  the  minimal  variable  solution  matrices  required  to  factor  M. 

Proof 

The  solution   matrices  must  be  computed  for  A1  —  Lf  L,  A2  =  L2~  Lf XL,  .  .  .  , 


An_2  =  Ln_2  •  •  •  Lj  L.    It  takes  -* '-  flops  to  do  so  for  each  Aj.  The  same  must  be 


2 

n-2  ■(■  ,  1  ->  n-2  n-2 

done  for  Ul.     Hence,   it  takes  2£  l^l>  =  ^i2   +    £i=  -n(n-l)(n-2)  flops  alto- 

i=i      2  i=1  i=1         3 

gether. 

Q.E.D. 
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We  now  derive  upper  bounds  on  the  number  of  flops  required  to  carry  out  the 
matrix  multiplications  needed  to  construct  the  intermediate  results.  Recall  that  if  i  € 
[1,11-2],  then 


Li"1   = 


In-i-I 

0 

1 

0 

0 

xn-i 

1 

0 

0 

xn-ixn-i+l 

xn-i+l 

1 

1 

i-1 

llxn-i+j 

j=0 

i-1 
Hxn-i+j 

xn-l 

1 

Theorem  5.25.  Let  i  6  [l,n-2]  and  let  A  =  L^1  •  •  •  L^  *L.    Let  Rk  denote  the  kth  row  of  A. 
Assume  that  the  matrix  multiplication  is  computed  in  the  following  nested  fashion: 

Rn-i+j  ~*  xn— i-f-j-l  (  xn-i+j-2  (  '         xn-i+l  (  xn-i^n-i  +  ^n-i+1  )  + 

+  R-n-i+2  )  +     '  '  '    +  Rn-i+j-1  )  +  ^n-i+j- 
for  j  6  [l,i].    Then  the  matrix  multiplication  can  be  carried  out  using  i(n-i+2)  flops. 

Proof 

The  multiplication  leaves  rows  1  to  n-i+1  unchanged.  The  computation  of  the 
n-i+j  th  row  is  of  the  form  Rn_i+j  -»  x^+j^E  +  Rn-i+j  where  E  is  the  value  used  to  com- 
pute row  n-i+j-1.  Therefore,  it  takes  one  flop  per  row  element  to  compute  the  new  row. 
Since  A  is  lower  triangular  with  lower  bandwidth  n-i+1,  there  are  at  most  n-i+1  nonzero 
elements  in  each  row.  Furthermore,  the  nonzero  part  of  each  row  is  offset  by  one  loca- 
tion from  the  nonzero  part  of  the  row  directly  above  and  below  it.  Hence,  the  linear 
combination  needs  to  be  carried  out  for  only  n-i+2  elements  in  each  row.  Since  only  i 
rows  are  changed,  the  matrix  multiplication  can  be  carried  out  in  i(n-i+2)  flops. 

Q.E.D. 
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Corollary  5.26.  Let  M  be  an  n  x  n  matrix  and  assume  that  minimal  variable  oblique  elimi- 
nation is  successful  for  M.  Let  M  =  LU  be  the  LU  decomposition  ofM.  The  matrix 
multiplications  necessary  to  compute  the  minimal  variable  tridiagonal  decompositions  ofL 

andU  can  be  carried  out  using  —  (n-l)(n-2)(n+9)  flops. 

Proof. 

The  multiplications  L^L,  L2~1(Lf1)L,  .  .  .  ,  and  L^L^  ■  •  •  L^L),  each  of  which 
can  be  done  using  i(n-i+2)  flops,  must  be  computed.  The  same  must  be  done  for  U  . 

n-2  i 

Hence  it  takes  a  total  of  2£i(n-i+2)  =  -(n-l)(n-2)(n+9)  flops. 

i=i  3 

Q.E.D. 

We  now  combine  the  results  obtained  in  this  chapter  to  to  point  out  that  oblique 
elimination  is,  at  most,  an  o(n  )  operation. 

Corollary  5.27.  //  minimal  variable  oblique  elimination  is  successful  for  the  n  x  n  matrix 
M,  then  it  can  be  carried  out  using  o(n  )  flops. 

Proof. 

By  Corollaries  5.24  and  5.26,  it  takes,  at  most  ,  o(n3)  flops  to  compute  the  tridiago- 
nal decompositions  of  L  and  U.  Standard  algorithms  can  be  used  to  compute  the  LU 
decomposition  in  o(n3)  flops.  Thus,  the  whole  procedure  takes  o(n3)  flops. 

Q.E.D. 


In  fact,  each  of  the  separate  operation  counts  are  of  the  form  — n3  +  o(n2)  so  the  total 
operation  count  is  of  the  form  n3  +  o(n"). 
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We  have  described  minimal  variable  oblique  elimination  and  shown  that  it  can  be 
used  to  compute  tridiagonal  decompositions  of  the  Fourier  matrices.  The  number  of 
parallel  steps  required  to  compute  a  one-dimensional  DFT  on  a  linearly  connected  array 
of  processors  is  2n-2  arithmetic  steps  and  at  most  n  permutation  steps  for  every  n.  More- 
over, since  the  permutation  steps  all  occur  at  once  and  are  at  the  end  of  the  computa- 
tion, it  is  possible  that  they  can  be  done  some  other  way. 


CONCLUSIONS  AND  SUGGESTIONS  FOR  FURTHER  RESEARCH 

We  have  demonstrated  that  image  algebra  is  useful  as  a  model  of  parallel  image 
processing  and  as  a  tool  for  the  development  of  parallel  algorithms  for  computing  linear 
image  to  image  transforms.  We  have  shown  that  the  algebraic  relationships  resulting 
from  the  establishment  of  an  algebraic  structure  for  digital  image  processing  can  yield 
useful  results  and  techniques. 

More  specifically,  we  have  shown  that 

1.)  The  image  algebra  provides  an  alternative  algebraic  formulation  of  linear  image 
processing  that  reflects  contemporary  computing  environments. 

2.)  Local  decompositions  of  linear  transforms  exist  in  all  "reasonable"  cases;  that  is, 
we  have  shown  that,  given  a  network  of  processors  interconnected  by  some  network  of 
communications  links,  any  linear  transformation  can  be  factored  into  a  sequence  of 
linear  transformations  having  the  property  that  each  transformation  in  the  sequence  is 
local  with  respect  to  the  network  if  and  only  if  there  is  a  path  between  every  pair  of  pro- 
cessors in  the  network. 

3.)  The  image  algebra  can  serve  as  an  algebraic  model  for  linear  computations  on 
computer  networks  based  on  group  structures,  Cayley  networks,  and  provides  a  link 
between  these  networks  and  existing  algebraic  structures.  We  have  characterized  those 
linear  transformations  which  are  translation  invariant  with  respect  to  a  Cayley  network 
and  shown  that  the  set  of  all  such  transformations  is  an  algebra  which  is  isomorphic  to 
an  existing  algebraic  object  called  the  group  algebra. 
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4.)  Research  in  developing  computer  programs  for  factoring  multivariable  polynomi- 
als has  applications  to  parallel  image  processing.  In  particular,  convolutions  can  be 
decomposed  into  smaller  convolutions  by  factoring  polynomials  corresponding  to  the 
images  to  be  convolved. 

5.)  It  is  feasible  to  implement  discrete  Fourier  transforms  of  sizes  other  than  powers 
of  two  on  mesh-connected  arrays.  We  have  developed  algorithms  and  written  programs 
to  show  how  this  can  be  done.  We  have  provided  estimates  on  the  number  of  steps 
required  to  implement  such  algorithms  on  mesh-connected  arrays. 

6.)  Numerical  methods  can  be  used  to  develop  alternative  methods  for  computing 
small  discrete  Fourier  transforms  locally,  particularly  on  processors  that  are  designed  to 
implement  floating  point  operations  quickly.  We  have  shown  how  this  can  be  done  and 
have  provided  FORTRAN  programs  which  apply  these  numerical  methods  to  the 
Fourier  matrices  to  compute  the  coefficients  required  to  implement  DFTs  locally  using 
this  technique  and  which  use  these  coefficients  to  compute  DFTs  locally  for  arbitrary 
values  of  n  less  than  a  certain  upper  bound.  These  algorithms  take  2n-2  parallel  steps  to 
compute  the  DFT  of  length  n  for  any  n  less  than  this  upper  bound  and,  at  most,  n  ele- 
mentary parallel  permutation  steps. 

The  existence  theorem  in  Chapter  2  (Theorem  2.16)  and  the  use  of  relationships 
between  image  algebra  and  other  algebraic  structures  in  developing  local  decompositions 
suggest  that  it  may  be  possible,  at  some  point  in  the  near  future,  to  build  an  image  alge- 
bra compiler  that  is  capable  of  decomposing  classes  of  templates  into  products  of  local 
templates.  The  existence  of  such  a  compiler  would  ease  the  burden  of  developing  parallel 
algorithms  for  programmers.  On  a  smaller  scale,  the  results  presented  in  this  disserta- 
tion would  seem  to  imply  that  researchers  should  have  the  ability  to  develop  computer 
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programs  that  would  use  algebraic  methods  to  decompose  computations  into  forms  com- 
patible with  parallel  architectures  in  the  not  too  distant  future.  More  research  on  gen- 
eral methods  would  need  to  be  done  for  these  things  to  become  a  reality. 

We  now  list  some  suggestions  for  further  research.  Many  of  these  suggestions  are 
questions  about  which  the  author  is  curious.  Therefore,  if  results  are  obtained  that  are 
related  to  the  following  problems,  it  would  be  appreciated  if  the  author  were  made  aware 
of  them.  We  have  attempted  to  present  the  problems  in  the  same  order  as  the  material 
in  the  main  body  of  the  dissertation  on  which  they  are  based.  There  is  some  overlap 
however. 

One  area  of  research  is  the  generalization  of  the  results  in  this  dissertation  to  the 
E3  and  ©  operations.  Can  an  analogue  of  the  existence  theorem  of  Chapter  2  be  formu- 
lated and  proved?  Miller  [31]  represents  the  Minkowski  operations  by  convolution  opera- 
tions. The  operations  0  and  ©  are  both  related  to  the  Minkowski  operations.  Can 
analogues  of  circulant  templates  be  defined  for  0  and/or  ©  using  Miller's  work,  or 
group  algebras,  as  a  guide?  If  so,  is  there  an  analogue  of  the  discrete  Fourier  transform? 
of  the  FFT?  Given  affirmative  answers  to  these  questions,  are  there  methods  analogous 
to  those  given  in  Chapter  4  for  deriving  local  decompositions  of  templates  with  respect 
to  M   and  ©? 

It  has  been  shown  that  the  algebra  generated  by  inversion  and  composition  of  Toe- 
plitz  integral  operators  is  dense  in  the  space  of  arbitrary  kernels  and,  therefore,  that  if 
K(t,u)  is  an  arbitrary  pxp  kernel  matrix,  then  the  mapping  f  -+  g  defined  by 

T, 

g(t)  =  /K(t,u)f(u)du  0  <  t  <  T  . 
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can  be  approximated  using  FFTs  or  fast  convolution  algorithms  [23].  Interpret  this 
result  in  the  setting  of  image  algebra.  Use  it  and  any  related  results  to  develop  efficient 
methods  for  approximating  arbitrary  linear  image  to  image  transforms  by  fast,  local 
transforms.  Investigate  methods  applicable  to  special  classes  of  linear  transforms. 
Recall  that  Schwartz  has  deemed  this  a  problem  of  great  interest  in  robot  vision  [50]. 
Find  out  which  integral  operators  are  of  special  interest  in  this  application  and  develop 
methods  for  approximating  them  using  the  above  result. 

Develop  local  decompositions  of  Fourier  matrices  and/or  circulant  templates  using 
Winograd's  FFT  algorithm  [6,  66].  Parlett  has  expressed  this  FFT  in  terms  of 
eigenvalue-eigenvector  decompositions  of  circulant  matrices  [38].  Perhaps  this  fact  can 
be  of  use. 

Generalize  the  methods  used  in  Chapter  4  to  other  families  of  matrices  that  are 
somehow  "built  up"  using  Kronecker  products.  The  closest  relative  to  the  discrete 
Fourier  transform  having  this  property  is  the  Hadamard  transform.  Hadamard  matrices 
having  dimensions  a  power  of  two  can  be  defined  in  terms  of  Kronecker  Products  similar 
to  the  DFT.  In  fact,  they  can  be  considered  to  be  multidimensional  DFTs  of  length  two 
in  each  dimension.  In  any  event,  it  is  immediate  that  these  matrices  have  local  decom- 
positions similar  to  the  DFT  in  the  radix-two  case.  What  happens  in  the  general  case?  Is 
there  an  analogue  to  the  Rader  prime  algorithm  for  Hadamard  matrices?  Answer  these 
questions  for  more  general  families.  From  a  different  perspective,  develop  efficient 
methods  for  determining  if  a  given  matrix  can  be  factored  into  a  Kronecker  product  of 
two  matrices,  and,  if  so,  for  computing  the  decomposition.  Such  a  decomposition  would 
reduce  the  amount  of  computation  required  to  implement  the  linear  transform 
corresponding  to  that  matrix. 
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Expand  on  the  results  in  Chapter  5.  The  existence  theorems  of  Chapter  2  imply 
that  every  square  matrix  can  be  factored  into  a  product  of  tridiagonal  matrices.  Use 
methods  other  than  the  minimal  variable  method  to  solve  the  system  of  nonlinear  equa- 
tions that  need  to  be  satisfied  for  oblique  elimination  to  be  successful.  Characterize  the 
class,  or  classes,  of  matrices  for  which  these  methods  will  be  successful.  The  method 
developed  by  Tchuente  will  not  work  on  every  matrix  and  the  minimal  variable  oblique 
algorithm  is  even  more  restrictive.  Moreover,  tridiagonal  decompositions  obtained  using 
oblique  elimination  of  any  sort  results  in  bidiagonal  rather  than  tridiagonal  factors. 
Develop  entirely  different  methods  for  factoring  square  matrices  into  products  of  tridiag- 
onals.  We  know  of  no  other  work  in  the  area  and  communication  with  experts  in  the 
field  of  numerical  linear  algebra,  such  as  Professor  G.  H.  Golub  of  Stanford  University, 
suggest  that  the  problem  of  factoring  arbitrary  square  matrices  into  products  of  tridiago- 
nal matrices  is  as  yet  unsolved.  More  generally,  develop  methods  for  factoring  linear 
transformations  into  products  of  local  transformations  relative  to  other  configurations 
that  model  parallel  computer  architectures.  Again,  by  the  existence  theorem  of  Chapter 
2,  this  can  be  done  for  a  large  class  of  configurations.  Use  the  concepts  of  Fourier 
analysis  and  synthesis  in  group  representation  theory  to  develop  methods  for  factoring 
G-templates  into  products  of  templates  local  with  respect  to  Cayley  networks. 


APPENDIX 
COMPUTER  PROGRAMS 


In  this  appendix,  we  present  computer  programs  used  to  implement  some  local  algo- 
rithms for  computing  DFT's  using  local  decompositions  of  Fourier  matrices  derived  in 
chapters  4  and  5.  We  include  these  programs  for  two  reasons:  One  reason  is  so  that  the 
programs  can  be  duplicated  by  someone  else  who  may  be  interested  to  see  the  algorithms 
work  and  the  other  is  to  illustrate  the  use  of  the  image  algebra  presented  in  this  disser- 
tation as  a  basis  for  a  algebraically  based,  parallel  programming  language.  These  pro- 
grams were  written  and  run  on  a  VAX  11/750  running  on  the  4.2  BSD  UNIX  operating 
system.  The  first  two  programs  are  written  in  FORTRAN  77  and  are  used  to  compute 
minimal  variable  decompositions  of  Fourier  matrices  and  to  implement  the  parallel  algo- 
rithms resulting  from  these  decompositions.  We  then  describe  two  programs  which  were 
written  using  an  extension  of  FORTRAN  77  which  provides  for  the  use  of  image  algebra 
operands  and  operators.  One  of  these  programs  will  compute  the  DFT  of  a  100  x  100 
image  using  the  FFT-based  method  whereas  the  other  is  a  hybrid  program  using  both 
the  FFT-based  method  and  oblique  elimination.  We  point  out  that,  since  these  pro- 
grams were  run  on  a  serial  machine,  they  are  not  truly  parallel  programs.  The  programs 
written  using  the  extended  image  algebra  FORTRAN  is  written  in  parallel  fashion  since 
it  uses  the  image  algebra.  The  program  for  computing  tridiagonal  decompositions  of 
Fourier  matrices  by  minimal  variable  oblique  elimination  is  not  meant  to  be  parallel. 
The  program  for  implementing  the  parallel  algorithms  resulting  from  these  decomposi- 
tions is  written  in  a  serial  fashion  but  can  easily  be  made  parallel.  It  illustrates  the  ease 
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with  which  a  program  for  computing  the  DFT  of  sequences  of  arbitrary  length  locally 
can  be  written.  The  hybrid  program  mentioned  earlier  contains  a  segment  of  code  which 
is  a  parallel  implementation  of  the  local  algorithm  for  computing  5-point  DFT's  resulting 
from  minimal  variable  oblique  elimination. 

FORTRAN  Programs  for  Minimal  Variable  Oblique  Elimination 

In  this  section,  we  present  two  programs,  coeffgen.f  and  oem.f.  The  programs 
coeffgen.f  takes  as  input  an  integer  n  >  1  and  generates  the  coefficients  and  permutation 
information  corresponding  to  the  minimal  variable  decomposition  of  Fn.  A  description  of 
the  program  steps  follows: 

1.)  Input  n. 

2.)  Generate  Fn. 

3.)  Compute  LU  decomposition  of  PFn  for  some  permutation  matrix  P  using  Gaus- 
sian elimination  with  partial  pivoting;  that  is,  compute  the  decomposition  Fn  =  PLU 
where  P  is  a  permutation  matrix,  L  is  a  unit  lower  triangular  matrix,  and  U  is  an  upper 
triangular  matrix. 

4.)  Use  minimal  variable  oblique  elimination  to  compute  the  decompositions 

L  =  YjY2  •  •  •  Yn_2B 
and 

U  =  CXn_2Xn_3  •  ■  ■  Xx 
where,  for  i  6  [l,n-2],  the  matrices  Yj  and  X;  are  unit  lower  and  upper  bidiagonal,  respec- 
tively, and  B  and  C  are  lower  and  upper  bidiagonal,  respectively.    In  the  program,  the 
sub-diagonal  elements  of  the  matrices  Yx,  Y2,  ■  •  •  ,  and  Yn  are  stored   in   the  first  n-2 
rows  of  the  two-dimensional  array  Y(i,j)  and  the  sub-diagonal  and  diagonal  elements  of 
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the  matrix  B  are  stored  in  the  last  two  rows  of  Y(i,j).  Similarly,  the  super-diagonal  ele- 
ments of  the  matrices  Xb  X2,  •  •  •  ,  and  Xn  are  stored  in  the  first  n-2  rows  of  the  two- 
dimensional  array  X(i,j)  and  the  super-diagonal  and  diagonal  elements  of  the  matrix  C 
are  stored  in  the  last  two  rows  of  X(i,j).  These  are  the  coefficients  that  we  refer  to  and 
they  are  written  to  the  files  xcoeffn  and  ycoeffn. 

4.)  Generate  a  table  oempermn  containing  information  about  the  permutation  used 
to  implement  the  partial  pivoting.  An  array  R  is  constructed  which  contains  this  infor- 
mation and  which  is  written  to  the  file  oempermn.  The  array  R  is  constructed  according 
to  the  following  rule: 

do  i=l,n 

If  row  i  is  switched  with  row  j  during  the  partial  pivoting,  then  R(i)=j. 
Thus,  R  contains  a  sequential  record  of  the  transpositions  used  to  implement  partial 
pivoting.  The  permutation  matrix  P  in  3.)  represents  the  product  of  the  inverse  of  these 
transpositions  and  is  implemented  accordingly.  For  example,  suppose  n  =  5  and  first 
row  2  is  switched  with  row  3  and  then  row  3  is  switched  with  row  4  and  the  other  rows 
remain  fixed.  Then  P  represents  the  permutation  a  =  ((23)(34))_1  =  (34)(23).  Thus,  P 
can  be  implemented  by  starting  through  an  array  of  data  of  length  5  at  the  fifth  location 
and  sequentially  making  the  transpositions  as  necessary.  Since  we  are  using  a  serial 
machine,  this  is  the  method  that  we  use  rather  than  the  OETS  which  is  very  time  con- 
suming on  such  a  machine. 

The  program  coeffgen.f  follows: 


154 


C  Program  coeffgen.f 

complex  A(100,100),B(100,100),C(100,100) 

complex  X(100,100),Y(100,100) 

integer  R(100) 

complex  wn,z,d 

character  yfile*9,xfile*9,blocksize*3,permfile*10 

c 

C  This  program  will  generate  the  coefficients  necessary  to 

C  compute  the  one-dimensional  DFT  of  blocksize  n  locally 

C  using  oblique  elimination.  It  will  also  generate  the  permutations. 

C  The  coefficients  are  written  to  the  files  xcoeffn  and 

C  ycoeffn  and  the  permutations  are  written  to  the  fileoempermn. 

C 

print  *," Enter  the  character  representation  of  the  blocksize," 

print  *,"that  is,  enter  the  blocksize  enclosed  in  quotes." 

read  *, blocksize 

xfile='xcoeff'//blocksize 

yfile='ycoeff'//blocksize 

permfile='oemperm'//blocksize 

print  *,"Enter  numerical  blocksize." 

read  *,n 

print  *," blocksize  =",n 

C  The  Fourier  matrix  of  order  n  is  generated  here 

pi=3.14159265 

pn=2*pi/n 

wn=cmplx(cos(pn),-sin(pn)) 

do  101  i=l,n 
do  102  j=l,n 

A(i,j)=wn**((i-l)*(j-l)) 
102         continue 
101        continue 
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C  The  LU  decomposition  of  the  Fourier  matrix  is  computed  here 

do  105  i=l,n-l 

B(i,i)-(1.0,0.0) 

C  The  partial  pivoting  is  done  here. 

nmax=i 

do  201  np=i+l,n 
If  (abs(A(np,i))  .gt.  abs(A(nmax,i)))  Then 

nmax=np 
endif 

201  continue 
R(i)=nmax 

If  (nmax  .ne.  i  )  then 
do  202  ns=i,n 
z  =  A(i,ns) 
A(i,ns)=A(nmax,ns) 
A(nmax,ns)=z 

202  continue 
endif 

do  106  k=i+l,n 
If  (  A(i,i)  .EQ.  (0.0,0.0))  then 
print  *,"  at  i=  ",i 

print  *,"Ran  into  zero  divide  using  Gaussian  Elimination." 
stop 
else 

B(k,i)  =  A(k,i)/A(i,i) 
endif 
do  107  j=i+l,n 
A(k,j)=A(kj)-B(k,i)*A(i,j) 
107        continue 
106        continue 
105        continue 
R(n)=n 
C(n,n)=(l.0,0.0) 
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C  The  permutations  of  the  lower  triangular  required  to  compensate 

C  for  the  partial  pivoting  are  performed  here. 

do  650  i=n- 1,1,-1 
do  651  j=i,n 
C(j,i)=B(j,i) 

651  continue 

if  (R(i)  .ne.  i)  then 
do  652  j=l,n 

z=C(R(i)J) 

C(R(i),j)=C(i,j) 

C(i,j)=z 

652  continue 
endif 

650        continue 

do  655  i=l,n-l 
if  (R(i)  .ne.  i)  then 
do  656  j=l,n 
z=C(R(i),j) 
C(R(i),j)=C(i,j) 
C(i,j)=z 
656  continue 

endif 
655        continue 


C  We  set  the  subdiagonals  of  the  upper  triangular  factor 

C  equal  to  zero  here.  This  is  necessary  to  avoid  errors 

C  during  the  oblique  elimination. 

do  150  i=2,n 
do  150  j=l,i-l 
A(i,j)=(0.0,0.0) 
150        continue 
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C  The  coefficients  are  calculated  here.  That  is  the  oblique  elimination 

C  is  performed  on  the  upper  and  lower  triangular  factors  of  the  Fourier 

C  matrix 

do  99  ns=l,2 

do  20  L=l,n-2 

do  5  j=l,n-l-L 
X(L,j)=cmplx(0.0,0.0) 
5  continue 

do  10  k=n-L,n-l 
d=C(k,k+l-(n-L)) 

J=l 
z=cmplx(l.  0,0.0) 

15  If  (X(L,k-j)  .NE.  cmplx(0.0,0.0))  then 

z=z*X(L,k-j) 
d=d+C(k-j,k+l-(n-L))*z 

go  to  15 
endif 

If  (  d  .EQ.  cmplx(0.0,0.0)  )  then 
print  *,"ran  into  zero  divide" 
stop 
endif 

X(L,k)=-C(k+l,k+l-(n-L))/d 
10  continue 

call  ltmult(X,C,L,n) 
20         continue 
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C  At  this  point  the  coefficients  of  the  last  tridiag  are  put  in  the 

C  last  two  rows  of  the  coefficient  array 

X(n-l,l)=C(l,l) 
do  601  j=2,n 

X(n-l,j)=C(j,j) 

X(n,j)=C(j,j-l) 
601        continue 


If  (ns  .eq.  1  )  then 

C  At  this  point  the  coefficients  are  put  into  another  array  for  storage 

C  while  the  coefficients  of  the  upper  tridiagonal  are  computed 

do  603  i=l,n 
do  602  j=l,n 
Y(i,j)=X(i,j) 

602  continue 

603  continue 

C  The  upper  triangular  matrix  is  taken  out  of  storage  so  that  it 

C  can  be  factored 


do  501  i=l,n 

do  502  j=l,n 

C(i,j)=A(j,i) 

502 

continue 

501 

continue 

endif 

99 

continue 
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C  The  coefficients  are  written  to  a  file  here. 

open(l5,file=xfile,status='new') 
do  902  i=l,n 
do  901  j=l,n 

write(15,*)  X(i,j) 

901  continue 

902  continue 
close(15) 

open(15,file=yfile,status='new') 
do  904  i=l,n 
do  903  j=l,n 

write(15,*)  Y(i,j) 

903  continue 

904  continue 
close(15) 

open(15,file=permfile,status='new') 
do  950  i=l,n 
950  write(15,*)  R(i) 

close(15) 


end 
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Q    ************************************************************* 

subroutine  ltmult(X,B,L,n) 

complex  X(100,100),B(100,100),C(100,100) 

complex  y 

Q************************************************************************** 

c 

C  This  subroutine  performs  the  matrix  multiplication  necessary  to 

C  eliminate  the  L  th  oblique  of  the  lower  banded  n  x  n  matrix  B. 

C  B  has  zeroes  in  the  1,2,...,L-1  lower  obliques  on  input  and 

C  zeroes  in  the  1,2, ...,L  lower  obliques  on  output.  The  coefficients 

C  required  to  perform  this  elimination  are  stored  in  the  array  X 

C  and  have  been  computed  in  the  main  program. 

C 

Q************************************************************************** 

do  25  i=l, n 
do  26  j=l,n 
c(i,j)=(0-0,0.0) 

26  continue 

25  continue 

C(1,1)=B(1,1) 
do  30  i=2,n 
y=(l-0,0.0) 
do  35  k=i,2,-l 
if  (  k  .NE.  i  )  then 

y=y*X(L,k) 
endif 
do  40  j=l,i 

C(i,j)=C(i,j)+y*B(k,j) 
40  continue 

35  continue 

30         continue 

do  50  ii=l,n 

do  55  jj=l,n 

B(ii,jj)=C(ii,jj) 
55  continue 

50  continue 

end 
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The  program  oem.f  uses  the  information  generated  by  coeffgen.f  to  compute  the 
one-dimensional  DFT  of  a  sequence  of  length  n  locally.  The  following  steps  are  executed 
by  the  program: 

1.)  Input  n,  filename  of  file  containing  input  sequence,  and  filename  of  file  to  write 
output  sequence  to, 

2.)  Read  from  coefficient  and  permutation  files  generated  by  coeffgen.f, 

3.)  Read  input  sequence  V, 

4.)  Perform  the  following  sequence  of  calculations: 

do  i=l,n-2 
y  «-  Xj  v, 

5.)  Perform  v  <—  Cv, 

6.)  Perform  v  *—  Bv, 

7.)  Perform  the  following  sequence  of  calculations: 

do  i=n-2,l 

v  -  Y;v , 

8.)  Perform  V  «—  Pv, 

9.)  Write  output  sequence  to  a  file, 
where  Fn  =  PYjY2  ■  ■  ■  BCXn_2Xn_3  •  •  •  Xx  is  the  minimal  variable  decomposition  of  Fn. 
The  program  oem.f  follows: 
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C  Program  oem.f 

complex  X(100,100),Y(100,100),V(100),W(100) 

integer  R(100) 

complex  wn,z 

character  xfile*9,yfile*9,permfile*10,blocksize*3 

character  output*10,input*10 

C  This  program  will  compute  an  n-point  DFT  locally  using 

C  the  decompositions  computed  using  oblique  elimination. 

C  The  program  reads  the  coefficients  from  the  files  xcoeffn 

C  and  ycoeffn  and  the  permutations  from  the  file  oempermn. 

C  It  reads  the  input  sequence  from  the  user  defined  file 

C  input  and  writes  the  output  sequence  to  the  user  defined 

C  file  output. 

C  The  coefficient  and  permutation  input  filenames  are  constructed  here. 

print  *, "Enter  the  character  representation  of  the  blocksize," 

print  *,"that  is,  enter  the  blocksize  enclosed  in  quotes." 

read  *, blocksize 

xfile='xcoeff '//blocksize 

yfile='ycoeff  '//blocksize 

permfile='oemperm'//blocksize 

print  *," Enter  the  numerical  representation  of  the  blocksize." 

read  *,n 

C  The  input  and  output  filenames  are  input  here. 

print  *,"Enter  the  input  filename  (enclosed  in  quotes)." 

read  *, input 

print  *," Enter  the  output  filename  (enclosed  in  quotes)." 

print  *,"It  must  be  a  new  file." 

read  *,output 
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C  The  coefficients  and  permutation  information  are  input  here. 

open(l5,nle=xfile,status='old') 
do  1  i=l,n 
do  2  j=l,n 

read(15,*)  X(iJ) 
2  continue 

1  continue 

close(l5) 


open(15,file=yfile,status='old') 
do  3  i=l,n 
do  4  j— l,n 

read(15,*)  Y(i,j) 

4  continue 
3            continue 

close(  1 5) 

open(15,file=permfile,status='old') 
do  5  i=l,n 

5  read(15,*)  R(i) 
close(15) 

C  The  input  sequence  is  obtained  here. 

open(15,file=input,status='old') 
do  6  i=l,n 

6  read(15,*)  V(i) 
close(15) 
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C  The  DFT  of  the  array  V  is  now  computed  locally  using  the 

C  coefficients  derived  from  the  oblique  elimination  process. 

do  7  i=l,n-2 
do  8  j=l,n-l 

W(j)=V(j)-X(i,j)*V(j+l) 

8  continue 

do  9  j=l,n-l 
V(j)=W(j) 

9  continue 
7           continue 

do  10  j=l,n-l 
W(j)=X(n-l,j)*V(j)+X(n,j+l)*V(j+l) 

10  continue 
V(n)=X(n-l,n)*V(n) 
do  11  j=l,n-l 

V(j)=W(j) 

11  continue 

do  12  j=n,2,-l 

W(j)=Y(nJ)*V(j-l)+Y(n-l,j)*V(j) 

12  continue 

do  13  j=2,n 
V(j)=W(j) 

13  continue 
V(l)=Y(n-l,l)*V(l) 

do  14  i=n-2,l,-l 
do  15  j=n,2,-l 

W(j)=V(j)-Y(i,j-l)*V(j-l) 

15  continue 

do  16  j=l,n 
V(j)=W(j) 

16  continue 

14  continue 

do  17  j=l,n 

V(j)=(l/cmplx(n))*V(j) 

17  continue 
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C  The  permutations  corresponding  to  the  partial 

C  pivots  are  executed  here.  Since  the  permutations  are  generated 

C  from  the  partial  pivoting  process,  they  are  represented  as  a 

C  sequence  of  transpositions  corresponding  to  the  switching 

C  of  the  rows. 

do  18  i=n,l,-l 
if  (R(i)  .ne.  i)  then 

z=V(i) 

V(i)=V(R(i)) 

V(R(i))=z 
endif 

18  continue 

C  The  output  sequence  is  now  written  to  a  file. 

open(15,file=output,status='new') 
do  19  i=l,n 

19  write(15,*)  V(i) 
close(15) 

end 


These  programs  have  been  run  and  tested  against  various  existing  DFT  and  FFT 
programs.  The  results  indicate  that  round-off  error  becomes  increasingly  significant  when 
computing  minimal  variable  decompositions  of  Fourier  matrices.  When  n  gets  close  to 
30,  the  computations  "blow-up".  Use  of  double  precision  alleviates  the  problem  some- 
what. A  detailed  analysis  of  this  situation  is  needed.  We  point  out  that  the  decomposi- 
tions only  need  to  be  computed  once.  Thus,  it  is  not  unreasonable  to  consider  developing 
a  very  precise,  but  possibly  slow,  algorithm  for  generating  the  coefficients  required  for 
implementing  the  DFT,  or  some  other  particular  linear  transform,  locally  using  minimal 
variable  oblique  elimination. 
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Image  Algebra  Programs  for  the  100  x  100  DFT 

The  next  two  programs  that  we  describe  are  written  in  an  extended  FORTRAN  77 
code  made  available  by  an  image  algebra  preprocessor.  The  image  algebra  preprocessor 
is  a  FORTRAN  77  program  which  allows  for  programs  to  be  written  using  image  algebra 
operands  and  operators.  The  operands  and  operators  of  the  image  algebra  are  represent- 
able  in  an  extended  FORTRAN  77  code  which  is  input  to  the  preprocessor.  The  prepro- 
cessor scans  the  code  for  special  symbols  corresponding  to  image  algebra  objects.  The 
lines  containing  image  algebra  statements  are  replaced  by  equivalent  blocks  of  FOR- 
TRAN 77  code.  The  result  is  a  FORTRAN  77  program  which  can  be  compiled  and  run 
in  the  usual  way.  The  preprocessor  is  for  experimental  use  only  and  is  a  straightforward 
implementation  with  few  special  features  or  optimization  techniques  included;  we 
describe  it  only  as  much  as  is  needed  to  understand  the  code  that  we  present. 

Images  are  limited  to  two-dimensional  arrays  and  are  declared  as  such  in  the  usual 
way.  Templates  are  constructed  in  a  fashion  similar  to  subroutines  and  can  be  defined 
with  parameters,  in  fact,  they  are  implemented  as  subroutines.  They  are  declared  at  the 
beginning  of  the  main  program  and  can  either  be  defined  internally  after  the  end  of  the 
main  program  or  they  can  be  defined  externally  and  be  stored  in  a  template  library.  A 
template  is  defined  by  defining  the  weights  at  some  set  of  points.  This  set  of  points  is 
taken  to  be  the  configuration.  The  operations  of  -1-  and  *  between  images  are 
represented  by  these  symbols  for  the  preprocessor.  The  operation  of  ©  between  images 
and  templates  is  represented  by  the  symbol  +.  The  preprocessor  is  "aware"  of  the  types 
of  objects  that  it  is  dealing  with  and  interprets  the  +  sign  accordingly.  As  an  example, 
suppose  that  X  =  {(x,y)  :  0  <  x  <99,  0  <  y  <99  }  is  a  100  x  100  array  and  that  one 
wants  to  define  the  templates  (T ,  T),  (5,  S)  £  Lx  defined  by: 
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If  y  is  even  then  for  every  x 


t(xj)(u,v) 


If  y  is  odd  then  for  every  x 


1  if(u,v) 
1  if(u,v) 
0   else 


(x,y) 
(x,y+i). 


t(XJ)(u,v)  = 


-1  if(u,v) 
1  if  (u,v) 
0      else 


(x,y) 

(x,y-l). 


If  x  is  even  then  for  every  y 


<,y)(u>v) 


1  if(u,v) 
1  if(u,v) 
0   else 


(x,y) 
(x+i,y). 


If  x  is  odd  then  for  every  y 


(x,y 


)KV) 


-1  if(u,v) 
1  if  (u,v) 
0      else 


(xj) 
(x-l,y). 


The  templates  T  and  S  are  used  to  implement  the  transformation  corresponding  to 
the  matrix  l5o®F2  along  each  row  and  each  column,  respectively,  of  a  100  x  100  image. 
This  type  of  computation  is  sometimes  called  a  butterfly  computation.  An  example 
preprocessor  program  in  which  these  templates  are  combined  into  one  using  a  template 
with  parameter  follows: 
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complex  variant  bfly 

complex  A(100,100),B(100,100) 

A=100 

B=A+bfly(l) 
end 

complex  variant  template  bfly(nrc) 

if  (nrc  .eq.  1  )  then 

• 
ybar=mod(y-l,2) 
if  (ybar  .eq.  0)  then 

bfly(x,y)=cmplx(l) 

bfly(x,y+l)  =cmplx(l) 
else 

bfly(x,y)=cmplx(-l) 

bfly(x,y-l)=cmplx(l) 
endif 

else 

xbar=mod(x-l,2) 
if  (xbar  .eq.  0)  then 
bfly(x,y)=cmplx(l) 
bfly(x+l,y)  =cmplx(l) 
else 

bfly(x,y)=cmplx(- 1) 
bfly(x-l,y)=cmplx(l) 
endif 
endif 
end 

The  statement  A=100  will  initialize  every  location  of  A  to  the  value  100,  that  is,  it 
is  a  parallel  assignment  statement.  The  statement  B  =  A+bfly(l)  represents  B  = 
A©T.  The  first  image  algebra  program  that  we  describe,  dftl00dp.f85,  takes  as  input  a 
100  x  100  image,  A,  and  produces  as  output  the  DFT  of  A.  The  DFT  is  computed  using 
an  image  algebra  implementation  of  the  twiddle  free  FFT-based  decompositions  of  the 
Fourier  matrices.  The  addition  steps  are  all  implemented  locally  using  ©  between 
images  and  templates.    The  multiplication  steps  are  all  implemented  using  image  multi- 
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plication.     In    the    interest   of  saving  some   time,    the   permutations   are    implemented 

directly  rather  than  using  the  OETS. 

Specifically,  using  the  notation  of  Chapter  4,  the  decomposition  implemented  is  the 

following: 

F100  =  Q(4,25,3)(F4®I25)(I4<2)F25)(T4(C2159))P(25,4)  = 
=  P(25,4)T25(C43)(I25®F4)P(4,25) 
(I4®  P(5,5))(I20®  F5)(I4®  P(5,5))(I4®  D(5,5))(I20O  F5)(I4<g>  P(5,5))(T4(C2159))P(25,4). 
By  Theorem  4.12, 

F5  -  U5R1,5(2)Fi1)A5F4*(1'R2,5(2)U5t 
and,  by  Theorem  4.13, 

u5  =  v4v3v2v. 

Furthermore, 

I25OF4  =  (I26®P(2,2))(I26®F2)(I25®P(2)2))(I25®D(2)2))(I25®F2)(I25®P(2>2)). 
Thus,  if  we  make  the  substitutions 

Ri  —  Rl,5(2)> 

R2  —  R2,5(2)> 

P6  ee  T4(C2159)P(25,4), 

P5  m  (I4<g>P(5,5)), 

P4^(I20®R2)(I20®P(2,2)t), 
P3  m  (I20(8»P(2I2)t)(I20®R1)) 

P2  =  (I25®P(2,2))P(4,25)(I4(g>P(5,5)), 
and 
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Pl  =  P(25,4)T25(C43)(I25®P(2,2)), 
the    program    dftl00dp.f85    represents    the    implementation    of    the    sequence    of    local 
transformations  indicated  in  Table  A.l  along  each  row  of  the  input  image  and  then 

along  each  column  of  the  result. 

Table  A.l.  Sequence  of  Local  Transformations  Used  to  Implement 
100  x  100  DFT  Using  Pure  FFT-Based  Algorithm. 


Matrix 


Type       #  of  Parallel  Steps       Operations/Pixel/Step 


l.)P6 

P 

2.)P5 

P 

3.)  I^V; 

A 

4.)  IaoOVa1 

A 

5.)  Iao^Va1 

A 

6.)  IsoOV1 

A 

7.)  P4 

P 

8.)I20<g>(I2®F2)W 

A 

9.)  l2o<g>P(2,2)W 

P 

10.)  I20®  D(2,2)*W 

M 

ll.)I20®(I2®F2)W 

A 

12.)I20®P(2,2)(1» 

P 

13.)l2o<8>A5 

M 

14.)  I20(g)P(2,2)(1' 

P 

15.)I20®(I2<8>F2]<1) 

A 

16.)I20<S>D(2,2)W 

M 

17.)I20®P(2)2)W 

P 

18.)  Ia,®^®  F2)W 

A 

19.)  P3 

P 

20.)I20<g>V 

A 

21.)  I20OV2 

A 

22.)  I20®V3 

A 

23.)  IooOV4 

A 

24.)I4(g>D(5,5) 

M 

Repeat  steps  2.)  through  23.) 

25.)  P2 

P 

26)I50®F2 

A 

27.)I25(g)D(2,2) 

M 

28.)I25<g>P(2,2) 

P 

29.)l5o®F2 

A 

30.)  PI 

P 

82 
17 


100 


100 


M  stands  for  multiplication,  A  for  addition,  and  P  for  permutation. 
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Note  that  it  would  take,  at  most,  690  parallel  permutation  steps,  18  parallel  multi- 
plication steps,  and  52  parallel  addition  steps  if  this  program  were  implemented  on  a 
parallel  machine.  Since  they  require  a  significant  amount  of  floating  point  operations, 
some  of  the  images  used  by  the  program  to  perform  the  multiplications,  specifically,  the 
images  corresponding  to  the  matrices  I2o®A5  and  I4<g>D(5,5),  are  generated  and  stored 
in  files  to  be  read  as  input  to  the  program  dftlOOdp.f.  The  other  multiplication  images 
are  defined  internal  to  the  program  since  they  can  be  defined  without  floating  point 
operations.  The  permutations  Pl,P2,P3,P4,P5,P6  of  Table  A.l  are  also  pre-generated 
and  stored  in  files.  They  are  implemented  using  a  general  purpose  subroutine.  Due  to 
the  simplicity  of  their  structure,  the  other  permutations  are  implemented  directly.  Many 
of  the  template  definitions  involve  invoking  a  call  to  the  mod  function.  Thus,  two  tables 
are  generated  at  the  beginning  of  the  program  in  order  to  save  some  time.  The  program 
dftl00dp.f85  executes  the  following  steps: 

1.)  Read  in  input  image,  permutation  tables,  and  multiplication  images. 

2.)  Construct  multiplication  images. 

3.)  Execute  sequence  of  local  transformations  given  in  Table  A.l  twice,  first  along 
the  rows  and  then  along  the  columns  of  the  result.  The  same  templates  are  used  for 
both,  the  multiplication  images  are  transposed. 

4.)  Output  the  DFT  of  the  input  image. 
To   save   time,    the   transposition    is   not    performed    locally.     Jesshope    has   considered 
methods  for  transposing  images  on  mesh-connected   arrays   [21].    It  may  be  easier  to 
rebroadcast  the  information  to  the  individual  processors,  however.    The  main  program 
follows: 
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C  Program  dftl00dp.f85 

common  m2,  m5 

integer  m2(0:99),  m5(0:99) 

complex  variant  bfly,ebfly,v2,v2t,v3,v3t,v4,v4t,v,vt, trans 

complex  A(100,100),B(100,100) 

complex  D2CT(100,100),D22(100,100),D221(100,100) 

complex  D55(100,100),L5A(100,100) 

integer  p6a(100),p5a(100),p4a(100),p3a(100),p2a(100),pla(l00) 

call  makemods() 

C  Reading  in  image 

open(unit=15,file='testim',status='old') 

do  l,i=l,100 
1  read(15,200)(A(j,i),j=l,100) 

200        format(200(F8.3,lX)) 

close(15) 


reading  in  permutation  tables 

open(unit=15,file='Pl',status='old') 

read(15,*)(pla(j),j=l,100) 

close(15) 

open(unit=15,file='P2',status='old') 

read(l5,*)(p2a(j)J=l,100) 

close(15) 

open(unit=15,file='P3',status='old') 

read(15,*)(p3a(j),j=l,100) 

close(15) 

open(unit=15,file='P4',status='old') 

read(15,*)(p4a(j),j=l,100) 

close(15) 

open(unit=15,file='P5',status='old') 

read(15,*)(p5a(j),j=l,100) 

close(15) 

open(unit=15,file='P6',status='old') 

read(15,*)(P6a(j),j=l,100) 

close(l5) 
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C  reading  in  multiplication  image  lambda5 

open(unit=15,file='L5dp',status='old') 
do  7  j=l,100 

read(15  *)  L5A(l,j) 

do  8  n=l,100 

8  L5A(n,j)=L5A(l,j) 
7           continue 

close(15) 

C  reading  in  twiddle  factor  15  tensor  D(5,5) 

open(unit=15,file='twiddledp',status='old') 

do  15  j=l,100 

read(15,*)  D55(l,j) 

do  16  n=l,100 
16  D55(n,j)=D55(l,j) 

15  continue 

close(15) 

C  defining  the  twiddle  factors 

D22=cmplx(l) 
jbar=4 
do  9  j=l,25 
do  10  n=l,100 

10  D22(n,jbar)=(0,-1) 
jbar=jbar+4 

9  continue 

D221=cmplx(l) 
jbar=5 
do  11  j=l,20 
do  12  n=l,100 

12  D221(njbar)=(0,-1) 
jbar=jbar+5 

11  continue 

D2CT=cmplx(l) 
jbar=5 
do  13  j=l,20 
do  14  n=l,100 
14  D2CT(n,jbar)=(0,l) 

jbar=jbar+5 

13  continue 
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C  The  computations  begin  here 

do  100  k=l,2 

call  perm(A,B,p6a,k) 
L=l 
101         if  (  L  .It.  3  )  then 

call  perm(B,A,p5a,k) 

C  The  computation  of  the  5-point  transforms  is  begins  here. 

B=A+v4t(k) 

A=B+v3t(k) 

B=A+v2t(k) 

A=B+vt(k) 

call  perm(A,B,p4a,k) 

A=B+ebfly(k) 

call  p221(A,B,k) 

A=B*D2CT 

B=A+ebfly(k) 

call  p221(B,A,k) 

B=A*L5A 

call  p221(B,A,k) 

B=A+ebfly(k) 

A=B*D221 

call  p221(A,B,k) 

A=B+ebfly(k) 

call  perm(A,B,p3a,k) 

A=B+v(k) 

B=A+v2(k) 

A=B+v3(k) 

B=A+v4(k) 

C  The  computation  of  the  5-point  transforms  ends  here. 

if  (  L  .eq.  2  )  go  to  102 
A=B*D55 
B=A 
L=L+1 
go  to  101 
endif 
102     continue 
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call  perm(B,A,p2a,k) 

C  The  computation  of  the  4-point  transforms  begins  here. 

B=A+bfly(k) 
A=B*D22*0.01 
call  P22(A,B,k) 
A=B+bfly(k) 

C  The  computation  of  the  4-point  transforms  ends  here, 

call  perm(A,B,pla,k) 

C  The  multiplication  images  are  transposed  here. 

if  (k  .eq.  1)  then 

A=B 

B=D2CT+trans 

D2CT=B 

B=L5A+trans 

L5A=B 

B=D221+trans 

D221=B 

B=D55+trans 

D55=B 

B=D22+trans 

D22=B 
endif 

100        continue 

C  The  computations  end  here. 

open(unit=15,nle='alg.out',status='new') 
do  400  j=l,100 
write(l5,*)  B(j,l) 
400  continue 

close(15) 

end 
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The  templates  and  subroutines  used  by  dftl00dp.f85  are  the  following: 


complex  variant  template  bfly(nrc) 

common    m2,  m5 

integer  m2(0:99),  m5(0:99) 

This  is  the  template  corresponding  to  I50®F2. 

if  (nrc  .eq.  1  )  then 

ybar=m2(y-l) 

if  (ybar  .eq.  0)  then 

bfly(x,y)=cmplx(l) 

bfly(x,y+l)  =cmplx(l) 
else 

bfly(x,y)=cmplx(-l) 

bfly(x,y-l)=cmplx(l) 
endif 

else 

xbar=m2(x-l) 
if  (xbar  .eq.  0)  then 
bfly(x,y)=cmplx(l) 
bfly(x+l,y)  =cmplx(l) 
else 
bfly(x,y)=cmplx(-l) 
bfly(x-l,y)=cmplx(l) 
endif 
endif 
end 
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complex  variant  template  ebfly(nrc) 

common    m2,  m5 

integer  m2(0:99),  m5(0:99) 

This  is  the  template  corresponding  to  I2o®  (I2«S)  F2)^'. 
if  (nrc  .eq.  1  )  then 

ybar=m5(y-l) 
if  (ybar  .eq.  0)  then 
ebfly(x,y)=cmplx(l) 
else 

if  (ybar  .eq.  1  .or.  ybar  .eq.  3)  then 
ebfly(x,y)=cmplx(l) 
ebfly(x,y+l)  =cmplx(l) 
else 

if  (ybar  .eq.  2  .or.  ybar  .eq.  4)  then 
ebfly(x,y)=cmplx(-l) 
ebfly(x,y-l)=cmplx(l) 
endif 
endif 
endif 

else 

xbar=m5(x-l) 
if  (xbar  .eq.  0)  then 
ebfly(x,y)=cmplx(l) 
else 

if  (xbar  .eq.  1  .or.  xbar  .eq.  3)  then 
ebfly(x,y)=cmplx(l) 
ebfly(x+l,y)  =cmplx(l) 
else 

if  (xbar  .eq.  2  .or.  xbar  .eq.  4)  then 
ebfly(x,y)=cmplx(-l) 
ebfly(x-l,y)=cmplx(l) 
endif 
endif 
endif 
endif 


end 
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complex  variant  template  v2(nrc) 
common    m2,  m5 
integer  m2(0:99),  m5(0:99) 

if  (nrc  .eq.  1  )  then 

ybar=m5(y-l) 

if  (ybar  .eq.  2)  then 

v2(x,y)=cmplx(l) 

v2(x,y-l)  =cmplx(l) 
else 

v2(x,y)=cmplx(l) 
endif 

else 

xbar=m5(x-l) 
if  (xbar  .eq.  2)  then 
v2(x,y)=cmplx(l) 
v2(x-l,y)  =cmplx(l) 
else 

v2(x,y)=cmplx(l) 
endif 
endif 
end 

complex  variant  template  v2t(nrc) 
common    m2,  m5 
integer  m2(0:99),  m5(0:99) 

if  (nrc  .eq.  1  )  then 

ybar=m5(y-l) 

if  (ybar  .eq.  1)  then 

v2t(x,y)=cmplx(l) 

v2t(x,y+l)  =cmplx(l) 
else 

v2t(x,y)=cmplx(l) 
endif 

else 

xbar=m5(x-l) 

if  (xbar  .eq.  1)  then 

v2t(x,y)=cmplx(l) 

v2t(x+l,y)  =cmplx(l) 
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else 

v2t(x,y)=cmplx(l) 

endif 
endif 
end 

complex  variant  template  v3(nrc) 

common    m2,  m5 

integer  m2(0:99),  m5(0:99) 

if  (nrc  .eq.  1  )  then 

ybar=m5(y-l) 

if  (ybar  .eq.  3)  then 

v3(x,y)=cmplx(l) 

v3(x,y-l)  =cmplx(l) 
else 

v3(x,y)=cmplx(l) 
endif 

else 

xbar=m5(x-l) 
if  (xbar  .eq.  3)  then 
v3(x,y)=cmplx(l) 
v3(x-l,y)  =cmplx(l) 
else 

v3(x,y)=cmplx(l) 
endif 
endif 
end 

complex  variant  template  v3t(nrc) 
common    m2,  m5 
integer  m2(0:99),  m5(0:99) 

if  (nrc  .eq.  1  )  then 

ybar=m5(y-l) 

if  (ybar  .eq.  2)  then 

v3t(x,y)=cmplx(l) 

v3t(x,y+l)  =cmplx(l) 
else 

v3t(x,y)=cmplx(l) 
endif 

else 

xbar=m5(x-l) 
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if  (xbar  .eq.  2)  then 
v3t(x,y)=cmplx(l) 
v3t(x+l,y)  =cmplx(l) 
else 

v3t(x,y)=cmplx(l) 
endif 
endif 
end 

complex  variant  template  v4(nrc) 

common    m2,  m5 

integer  m2(0:99),  m5(0:99) 

if  (nrc  .eq.  1  )  then 

ybar=m5(y-l) 

if  (ybar  .eq.  4)  then 

v4(x,y)=cmplx(l) 

v4(x,y-l)  =cmplx(l) 
else 

v4(x,y)=cmplx(l) 
endif 

else 

xbar=m5(x-l) 
if  (xbar  .eq.  4)  then 
v4(x,y)=cmplx(l) 
v4(x-l,y)  =cmplx(l) 
else 

v4(x,y)=cmplx(l) 
endif 
endif 
end 

complex  variant  template  v4t(nrc) 
common    m2,  m5 
integer  m2(0:99),  m5(0:99) 

if  (nrc  .eq.  1  )  then 

ybar=m5(y-l) 

if  (ybar  .eq.  3)  then 

v4t(x,y)=cmplx(l) 

v4t(x,y+l)  =cmplx(l) 
else 

v4t(x,y)=cmplx(l) 
endif 
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else 

xbar=m5(x-l) 
if  (xbar  .eq.  3)  then 
v4t(x,y)=cmplx(l) 
v4t(x+l,y)  =cmplx(l) 
else 

v4t(x,y)=cmplx(l) 
endif 
endif 
end 

complex  variant  template  v(nrc) 

common    m2,  m5 

integer  m2(0:99),  m5(0:99) 

if  (nrc  .eq.  1  )  then 

ybar=m5(y-l) 

if  (ybar  .eq.  0)  then 

v(x,y)=cmplx(l) 
else 
if  (ybar  .eq.  1)  then 

v(x,y)=cmplx(l) 

v(x,y-l)=cmplx(l) 
else 

v(x,y)=cmplx(l) 

v(x,y-l)=cmplx(-l) 
endif 
endif 

else 

xbar=m5(x-l) 

if  (xbar  .eq.  0)  then 

v(x,y)=cmplx(l) 
else 

if  (xbar  .eq.  1)  then 
v(x,y)=cmplx(l) 
v(x-l,y)=cmplx(l) 
else 

v(x,y)=cmplx(l) 
v(x-l,y)=cmplx(-l) 
endif 
endif 
endif 
end 
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complex  variant  template  vt(nrc) 

common    m2,  m5 

integer  m2(0:99),  m5(0:99) 

if  (nrc  .eq.  1  )  then 

ybar=m5(y-l) 
if  (ybar  .eq.  4)  then 
vt(x,y)=cmplx(l) 
else 

if  (ybar  .eq.  0)  then 
vt(x,y)=cmplx(l) 
vt(x,y+l)=cmplx(l) 
else 

vt(x,y)=cmplx(l) 
vt(x,y+l)=cmplx(-l) 
endif 
endif 

else 

xbar=m5(x-l) 

if  (xbar  .eq.  4)  then 

vt(x,y)=cmplx(l) 
else 

if  (xbar  .eq.  0)  then 

vt(x,y)=cmplx(l) 

vt(x+l,y)=cmplx(l) 

else 

vt(x,y)=cmplx(l) 

vt(x+l,y)=cmplx(-l) 

endif 
endif 
endif 
end 
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subroutine  makemods() 

common  m2,  m5 

integer  m2(0:99),  m5(0:99) 

C  This  is  the  subroutine  to  generate  the  mod  tables. 

do  10  i  =  0,  99 

m2(i)  =  mod(i,2) 

m5(i)  =  mod(i,5) 
10      continue 
end 


subroutine  perm(A,B,P,rc) 
complex  A(100,100),B(100,100) 
integer  P(100),rc 

C  This  is  the  general  purpose  permutation  subroutine. 

C  A  contains  the  image  to  be  permuted. 

C  B  will  contain  the  output  image. 

C  P  contains  the  permutation  to  be  implemented. 

C  rc=l  indicates  permute  along  rows,  rc=2  along  columns. 

do  1  i=l,100 
do  1  j=l,100 
if  (  re  .eq.  1  )  then 
B(i,j)  =  A(i,P(j)) 
else 

B(i,j)=  A(P(i),j) 
endif 
1  continue 

end 
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subroutine  p221(A,B,nrc) 
complex  A(100,100),B(100,100) 

C  This  subroutine  implements  the  permutation  I20®  P(2,2)(1'. 

C  A  contains  the  image  to  be  permuted. 

C  B  will  contain  the  output  image. 

C  nrc=l  indicates  permute  along  rows,  nrc=2  along  columns. 

jbar=3 

B=A 

do  1J=1,20 

if  (nrc  .eq.  1)  then 

do  2  n=l,100 

B(n,jbar)=A(n,jbar+l) 

B(n,jbar+l)=A(n,jbar) 

2  continue 
else 

do  3  n=l,100 
B(jbar,n)==A(jbar+l,n) 
B(jbar+l,n)=A(jbar,n) 

3  continue 
endif 
jbar=jbar+5 

1  continue 

end 
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subroutine  p22(A,B,nrc) 
complex  A(100,100),B(100,100) 

C  This  subroutine  implements  the  permutation  I25®P(2,2). 

C  A  contains  the  image  to  be  permuted. 

C  B  will  contain  the  output  image. 

C  nrc=l  indicates  permute  along  rows,  nrc=2  along  columns. 

jbar=2 

B=A 

do  l,j=l,25 

if  (nrc  .eq.  1)  then 

do  2  n=l,100 

B(n,jbar)=A(n,jbar+l) 

B(n,jbar+l)=A(n,jbar) 

2  continue 
else 

do  3  n=l,100 
B(jbar,n)=A(jbar+l,n) 
B(jbar+l,n)=A(jbar,n) 

3  continue 
endif 
jbar=jbar+4 

1  continue 

end 

complex  variant  template  trans 

C  This  template  will  transpose  an  image. 

trans(y,x)=cmplx(l) 
end 
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The  next  program,  oeml00.f85,  is  the  same  as  dftl00dp.f85  except  that  instead  of 
using  the  Rader  prime  algorithm  and  the  convolution  theorem  to  compute  the  5-point 
transforms,  oblique  elimination  is  used.  Denote  the  minimal  variable  decomposition  of 
F5  by  F5  =  P5M8M7  •  -  •  Mv  Then,  the  program  is  an  implementation  of  the  sequence  of 
transformations  shown  in  Table  A.2. 


Table  A. 2.  Sequence  of  Local  Transformations  Used  to  Implement 
100  x  100  DFT  Using  Hybrid  Algorithm. 


Matrix 


l.)P6 

P 

82 

2.JP5 

P 

17 

3.)Mj 

A/M 

1 

4.)M2 

A/M 

1 

5.)M3 

A/M 

1 

6.)M4 

A/M 

1 

7.)M5 

A/M 

1 

8.)  Me 

A/M 

1 

9.)  MT 

A/M 

1 

10.)  M8 

A/M 

1 

11)  P5 

P 

5 

11.)  I4®D(5,5) 

M 

1 

Repeat  steps  2.)  through  10.) 

12.)  P2 

P 

100 

13.)I50(2)F2 

A 

1 

14.)I25®D(2,2) 

M 

1 

15.)I25®P(2,2) 

P 

1 

16.)Ieo®F2 

A 

1 

17.)  PI 

P 

100 

Type       #  of  Parallel  Steps       Operations/Pixel/Step 


2 
2 
2 

3 

2 
2 
2 
2 


M  stands  for  multiplication,  A  for  addition,  and  P  for  permutation. 


Note  that  this  implementation  would  take,  at  most,  654  parallel  permutation  steps, 
36  multiplication  steps,  and  36  addition  steps.  Although  these  numbers  do  not  appear 
to  be  as  good  as  for  the  pure  FFT-based  decompositions,  it  is  interesting  to  observe  that 
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the  program  oeml00I85  runs  significantly  faster  (about  30%)  than  dftl00dp.f85  on  the 
VAX.  A  possible  reason  for  this  is  that  the  different  types  of  steps  are  more  interspersed 
in  dftl00dp.f85.  Thus,  there  are  more  instructions  necessary;  in  particular,  there  are 
more  templates  involved.  Since  each  template  involves  a  subroutine  call  for  every  point 
in  the  image  every  time  that  it  is  used,  this  is  a  significant  point.  The  template 
definitions  are  also  more  straightforward.  Finally,  the  preprocessor  goes  through  the 
motions  of  multiplying  by  the  weights  in  the  ©  operation,  even  if  the  weights  are  l's. 
This  seriously  affects  the  performance  of  these  programs  since  every  template  used  in 
dftl00dp.f85  has  weights  either  0  or  1.  The  templates  tl,t2,...,t8  in  the  program 
correspond  to  the  matrices  M1(  M2,  •  •  •  ,  M8,  respectively.  They  were  defined  using  the 
coefficients  generated  by  the  program  coeffgen.f.    The  main  program  follows: 
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C  Program  oeml00.f85 

common  m2,  m5 

integer  m2(0:99),  m5(0:99) 

complex  variant  bfly,tl,t2,t3,t4,t5,t6,t7,t8, trans 

complex  A(100,100),B(100,100) 

complex  D22(100,100),D55(100,100) 

integer  P6a(100),p5a(  100), P2a(100),pla(100),oemp(100) 

call  makemods() 
C  Reading  in  image 

open(unit=15,nle='testim',status='old') 

do  l,i=l,100 
1  read(15,200)(A(i,j),j=l,100) 

200        format(200(F8.3,lX)) 

close(15) 

C  Reading  in  permutation  tables 

open(unit=15,nle='Pr,status='old') 

read(15,*)(pla(j),j=l,100) 

close(15) 

open(unit=15,nle='P2',status='old') 

read(15,*)(p2a(j),j=l,100) 

close(15) 

open(unit=15,file='P5',status='old') 

read(l5,*)(p5a(j),j=l,100) 

close(15) 

open(unit=15,file='P6',status='old') 

read(15,*)(p6a(j),j=l,100) 

close(15) 

open(unit=15,file='oemf85',status='old') 

read(15,*)(oemp(j),j=l,100) 

close(15) 
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C  reading  in  twiddle  factor  15  tensor  D(5,5) 

open(unit=15,file='twiddledp',status='old') 
do  15  j=l,100 
read(15,*)  D55(l,j) 
do  16  n=l,100 
16  D55(n,j)=D55(l,j) 

15  continue 

close(15) 

C  defining  the  twiddle  factors 

D22=cmplx(l) 
jbar=4 
do  9  j=l,25 
do  10  n=l,100 

10  D22(n,jbar)=(0rl) 

jbar=jbar+4 
9  continue 
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C  The  computations  begin  here 

do  100  k=l,2 
call  perm(A,B,p6a,k) 

L=l 

101         if  (  L  .It.  3  )  then 

call  perm(B,A,p5a,k) 

C  The  computation  of  the  5-point  transforms  begins  here. 

B=A+tl(k) 
A=B+t2(k) 
B=A+t3(k) 
A=B+t4(k) 
B=A+t5(k) 
A=B+t6(k) 
B=A+t7(k) 
A=B+t8(k) 
call  oemperm(A,oemp,k) 

C  The  computation  of  the  5-point  transforms  ends  here. 

if  (  L  .eq.  2  )  go  to  102 

B=A*D55 

L=L+1 
go  to  101 
endif 
102     continue 

call  perm(A,B,p2a,k) 

C  The  computation  of  the  4-point  transforms  begins  here. 

B=A+bfly(k) 
A=B*D22*0.01 
call  p22(B,A,k) 
B=A+bfly(k) 

C  The  computation  of  the  4-point  transforms  ends  here. 

call  perm(BA>plaik) 
if  (k  .eq.  1)  then 

B=D55+trans 
D55=B 
B=D22+trans 
D22=B 
endif 
100        continue 

C  The  computations  end  here. 
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open(unit=15,file='alg.out',status='new') 

do  400  j=l,100 

write(15,*)  B(1J) 
400  continue 

close(15) 


end 
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The  templates  and  subroutines  used  by  oeml00.f85  are  as  follows: 

complex  variant  template  bfly(nrc) 
common    m2,  m5 
integer  m2(0:99),  m5(0:99) 

C  This  is  the  template  corresponding  to  l5o<8>F2. 

if  (nrc  .eq.  1  )  then 

ybar=m2(y-l) 

if  (ybar  .eq.  0)  then 

bfly(x,y)=cmplx(l) 

bfly(x,y+l)  =cmplx(l) 
else 

bfly(x,y)=cmplx(-l) 

bfly(x,y-l)=cmplx(l) 
endif 

else 

xbar=m2(x-l) 
if  (xbar  .eq.  0)  then 
bfly(x,y)=cmplx(l) 
bfly(x+l,y)  =cmplx(l) 
else 

bfly(x,y)=cmplx(-l) 
bfly(x-l,y)=cmplx(l) 
endif 
endif 
end 

subroutine  makemods() 

common  m2,  m5 

integer  m2(0:99),  m5(0:99) 

C  This  is  the  subroutine  to  generate  the  mod  tables. 

do  10  i  =  0,  99 

m2(i)  =  mod(i,2) 
m5(i)  =  mod(i,5) 
10      continue 
end 
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subroutine  perm(A,B,P,rc) 

complex  A(100,100),B(100,100) 
integer  P(100),rc 

C  This  is  the  general  purpose  permutation  subroutine. 

C  A  contains  the  image  to  be  permuted. 

C  B  will  contain  the  output  image. 

C  P  contains  the  permutation  to  be  implemented. 

C  rc=l  indicates  permute  along  rows,  rc=2  along  columns. 

do  1  i=l,100 
do  1  j=l,100 

if  (  re  .eq.  1  )  then 

B(i,j)  -  A(i,P(j)) 

else 

B(i,j)=  A(P(i),j) 

endif 

1  continue 

end 

subroutine  p22(A,B,nrc) 

complex  A(100,100),B(100,100) 

C  This  subroutine  implements  the  permutation  I25(8>P(2,2). 

C  A  contains  the  image  to  be  permuted. 

C  B  will  contain  the  output  image. 

C  nrc=l  indicates  permute  along  rows,  nrc=2  along  columns. 

jbar=2 
B=A 

do  l,j=l,25 
if  (nrc  .eq.  1)  then 
do  2  n=l,100 

B(n,jbar)=A(n,jbar+l) 

B(njbar+l)=A(n,jbar) 

2  continue 
else 

do  3  n=l,100 
B(jbar,n)=A(jbar+l,n) 
B(jbar+l,n)=A(jbar,n) 

3  continue 
endif 
jbar=jbar+4 

1  continue 

end 
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complex  variant  template  trans 

C  This  template  will  transpose  an  image. 

trans(y,x)=cmplx(l) 
end 

The  next  eight  templates  are  the  templates  defined  using  minimal  variable  oblique  elimination. 

complex  variant  template  tl(k) 

common  m2,  m5 

integer  m2(0:99),  m5(0:99) 

tl(x,y)=(10A0) 

if  (k  .eq.  1)  then 
yb=m5(y-l) 
if(yb  .eq.  3)  then 

tl(x,y+l)=(l.0,0.0) 

endif 
else 

xb=m5(x-l) 

if(xb  .eq.  3)  then 

tl(x+l,y)=(1.0,0.0) 

endif 
endif 
end 

complex  variant  template  t'2(k) 

common  m2,  m5 

integer  m2(0:99),  m5(0:99) 

t2(x,y)=(l-0,0.0) 
if  (k  .eq.  1)  then 
yb=m5(y-l) 
if(yb  .eq.  2)  then 

t2(x,y+l)=(l. 0,0.0) 
endif 
if(yb  .eq.  3)  then 

t2(x,y+l)=(-0.809017,-0.587785) 
endif 
else 

xb=m5(x-l) 
if(xb  .eq.  2)  then 

t2(x+l,y)=(l.  0,0.0) 
endif 
if(xb  .eq.  3)  then 

t2(x+l,y)=(-0.809017,-0.587785) 
endif 
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endif 
end 

complex  variant  template  t3(k) 

common  m2,  m5 

integer  m2(0:99),  m5(0:99) 

t3(x,y)=(1.0,0.0) 

if  (k  .eq.  1)  then 
yb=m5(y-l) 
if(yb  .eq.  1)  then 

t3(x,y+l)=(1.0,0.0) 
endif 
if(yb  .eq.  2)  then 

t3(x,y+l)=(-°-809017r0.587785) 

endif 

if(yb  .eq.  3)  then 

t3(x,y+l)=(-0.809017,0.587785) 
endif 
else 

xb=m5(x-l) 
if(xb  .eq.  1)  then 

t3(x+l,y)=(10-00) 
endif 
if(xb  .eq.  2)  then 

t3(x+l,y)=(-0.809017,-0.587785) 
endif 
if(xb  .eq.  3)  then 

t3(x+l,y)=(-°-809017.0-587785) 
endif 
endif 
end 
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complex  variant  template  t4(k) 

common  m2,  m5 

integer  m2(0:99),  m5(0:99) 

if  (k  .eq.  1)  then 

yb=m5(y-l) 

if(yb  .eq.  0)  then 

t4(x,y)=(1.0,0.0) 
t4(x,y+l)=(l  0,0.0) 

end  if 

if(yb  .eq.  1)  then 

t4(x,y)=(-l. 80902,-0.587785) 
t4(x,y+l)=(l.  11803,1.53884) 

endif 

if(yb  .eq.  2)  then 

t4(x,y)=(-°-690983r2.12663) 
t4(x,y+l)=(1-80902,1.31433) 

endif 

if(yb  .eq.  3)  then 

t4(x,y)=(-2.5,0.8123) 
t4(x,y+l)=(0-0,2.62866) 

endif 

if(yb  .eq.  4)  then 

t4(x,y)=(l-54508--4.75528) 

endif 
else 

xb=m5(x-l) 

if(xb  .eq.  0)  then 

t4(x,y)=(1.0,0.0) 
t4(x+l,y)=(10>0.0) 

endif 

if(xb  .eq.  1)  then 

t4(x,y)=(-l. 80902,-0.587785) 
t4(x-r-l,y)=(l- 11803, 1.53884) 

endif 

if(xb  .eq.  2)  then 

t4(x,y)=(-0.690983,-2. 12663) 
t4(x+l,y))=(1.80902,1.31433) 

endif 

if(xb  .eq.  3)  then 

t4(x,y)=(-2.5,0.8123) 
t4(x+l,y)=(0.0,2.62866) 

endif 

if(xb  .eq.  4)  then 

t4(x,y)=(l. 54508,-4.75528) 

endif 
endif 
end 
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complex  variant  template  t5(k) 

common  m2,  m5 

integer  m2(0:99),  m5(0:99) 

t5(x,y)=(l. 0,0.0) 
if  (k  .eq.  1)  then 
yb=m5(y-l) 
if(yb  .eq.  1)  then 

t5(x,y-l)=(10,0.0) 
endif 
if(yb  .eq.  2)  then 

t5(x,y-l)=(-0.190983,-0.587785) 
endif 
if(yb  .eq.  3)  then 

t5(x,y-l)H0-809017,0.587785) 
endif 
if(yb  .eq.  4)  then 

t5(x,y-l)=(l.61803,0.0) 
endif 
else 

xb=m5(x-l) 
if(xb  .eq.  1)  then 

t5(x-l,y)=(1.0,0.0) 
endif 
if(xb  .eq.  2)  then 

t5(x-l,y)=(-0.190983,-0.587785) 
endif 
if(xb  .eq.  3)  then 

t5(x-l,y)=(0.809017,0.587785) 
endif 
if(xb  .eq.  4)  then 

t5(x-l,y)=(l. 61803,0.0) 
endif 
endif 
end 
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complex  variant  template  t6(k) 

common  m2,  m5 

integer  m2(0:99),  m5(0:99) 

t6(x,y)=(l.0,0.0) 
if  (k  .eq.  1)  then 
yb=m5(y-l) 
if(yb  .eq.  2)  then 

t6(x,y-l)=(1.0,0.0) 
endif 
if(yb  .eq.  3)  then 

t6(x,y-l)=(-l.  30902,-0.951057) 
endif 
if(yb  .eq.  4)    then 

t6(x,y-l)=(-10,0.0) 
endif 
else 

xb=m5(x-l) 
if(xb  .eq.  2)  then 

t6(x-l,y)=(1.0,0.0) 
endif 
if(xb  .eq.  3)  then 

t6(x-l,y)=(-1-30902»0-951057) 
endif 
if(xb  .eq.  4)  then 

t6(x-l,y)=(-1.0,0.0) 
endif 
endif 
end 
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complex  variant  template  t7(k) 

common  m2,  m5 

integer  m2(0:99),  m5(0:99) 

t7(x,y)=(1.0,0.0) 
if  (k  .eq.  1)  then 
yb=m5(y-l) 
if(yb  .eq.  3)  then 

t7(x,y-l)=(l.0,0.0) 
endif 
if(yb  .eq.  4)  then 

t7(x,y-l)=(-0.809017,0.587885) 
endif 
else 

xb=m5(x-l) 
if(xb  .eq.  3)  then 

t7(x-l,y)=(10,0.0) 
endif 
if(xb  .eq.  4)  then 

t7(x-l,y)=(-0.809017,0.587885) 
endif 
endif 
end 

complex  variant  template  t8(k) 

common  m2,  m5 

integer  m2(0:99),  m5(0:99) 

t8(x,y)=(l.  0,0.0) 
if  (k  .eq.  1)  then 
yb=m5(y-l) 
if(yb  .eq.  4)  then 

t8(x,y-l)=(i-0,0.0) 

endif 
else 

xb=m5(x-l) 

if(xb  .eq.  4)  then 

t8(x-l,y)=(1.0,0.0) 

endif 
endif 
end 
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subroutine  oemperm(A,oemp,k) 

complex  A(100,100) 
complex  z 
integer  oemp(lOO) 

C  The  permutations  corresponding  to  the  partial  pivots  are 

C  executed  here 

if  (k  .eq.  1)  then 
do  1  j=l,100 

do  5  i=100,l,-l 

if  (oemp(i)  .ne.  i)  then 
z=A(j,oemp(i)) 
A(j,oemp(i))=A(j,i) 
A(j,i)=z 

endif 

5  continue 

1  continue 
else 

do  2  j=l,100 

do  6  i=100,l,-l 

if  (oemp(i)  .ne.  i)  then 
z=A(oemp(i),j) 
A(oemp(i),j)=A(i,j) 
A(i,j)=z 
endif 

6  continue 

2  continue 
endif 

end 
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For  completeness,  we  include  the  programs  used  to  generate  the  permutation  tables 
and  the  multiplication  images. 

C  Program  lambda5.f 

complex  L(100),f(5),w5 
complex  f5 
complex  c0,cl,c2,c3 

C  This  is  a  program  which  generates  a  row  of  the 

C  image  corresponding  to  the  diagonal  matrix 

C  I20®A5. 

pi=3. 1415926 
twopi=2.0*pi 
xn=5.0 

arg=twopi/xn 
w5=cmplx(cos(arg),-sin(arg)) 

c0=w5-cmplx(l) 
cl=(w5**3)-cmplx(l) 
c2=(w5**4)-cmplx(l) 
c3=(w5**2)-cmplx(l) 

C  We  multiply  by  0.25  because  it  needs  to  be  done  in  order 

C  to  compute  the  forward  and  inverse  4-pt  DFT.  Since  lambda5 

C  is  used  in  between  it  is  a  convenient  place  to  absorb  the 

C  multiplication. 

f(l)=cmplx(l) 

f(2)=cmplx(0.25)*f5(cmplx(l),c0,cl,c2,c3) 

f(3)=cmplx(0.25)*f5(cmplx(0,-l),c0,cl,c2,c3) 

f(4)=cmplx(0.25)*f5(cmplx(-l),c0,cl,c2,c3) 

f(5)=cmplx(0.25)*f5(cmplx(0,l),c0,cl,c2,c3) 

do  1  j=l,100 

jbar=mod(j-l,5)+l 
L(j)=f(jbar) 
1  continue 

open(unit=15,file='L5',status='new') 
do  10  j=l,100 
10  write(l5,*)  L(j) 

close(15) 

end 
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complex  function  f5(x,c0,cl,c2,c3) 

complex  x 
complex  c0,cl,c2,c3 
f5=c0+x*(cl+x*(c2+x*c3)) 
end 

C  Program  d55.f 

complex  tw(l00),w5 
integer  r,q 

C  This  is  the  program  which  generates  a  row  of  the  image 

C  corresponding  to  the  diagonal  matrix  I20<8>D(5,5). 

pi=3.1415926 
twopi=2.0*pi 
xn=25.0 

arg=twopi/xn 
w5=cmplx(cos(arg),-sin(arg)) 

do  1  j=l,100 
jbar  =  mod(j-l,25) 
r=mod(jbar,5) 
q=(jbar-r)/5 
tw(j)=w5**(r*q) 
1  continue 

open(unit=15,file='twiddle',status='new') 
do  10  j=l,100 
10         write(15,*)  tw(j) 
close(l5) 

end 
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C  Program  genpl.f 

integer  s(100),so(100),r,q,qbar 

C  This  is  the  program  to  generate  the  permutation  table  PI 

do  1  j=l,100 
jbar=mod(j-l,4) 
if  (jbar  .eq.  1)  then 

s(j)=J+l 
else 
if  (jbar  .eq.  2  )  then 

s(j)H-i 

else 

■aw 

end  if 
endif 

1  continue 

C  elementary  circulant 

do  3  j=l,100 
jbar=j-l 
r=mod(jbar,4) 
q=(jbar-r)/4 
qbar=mod(q,4) 
if  (  r  +  qbar  .It.  4  )  then 

so(j)=s(j+qbar) 
else 

q=4-qbar 

so(j)=s(j-q) 
endif 
3  continue 

C  shuffle  25q+r->  q+4r 

do  2  j=l,100 
jbar=j-l 
r=mod(jbar,25) 
q=(jbar-r)/25 
k=q+4*r+l 
s(j)=so(k) 

2  continue 

open(unit=15,nle='Pr,status='new') 
write(15,*)(s(j)J=l,100) 
close(15) 

end 
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C  Program  genp2.f 

integer  s(100),so(100),r,q 

C  This  is  the  program  to  generate  the  permutation  table  P2 

C  125  tensor  P(5,5)  5q+r  ->q+5r 

do  1  j=l,25 
jbar  =j-l 
r=mod(jbar,5) 
q=(jbar-r)/5 
k=q+5*r+l 
s(j)=k 

s(j+25)=k+25 
s(j+50)=k+50 
s(j+75)=k+75 

1  continue 

C  shuffle  4q+r->  q+25r 

do  2  j=l,100 

jbar=j"! 

r=mod(jbar,4) 

q=(jbar-r)/4 

k=q+25*r+l 

so(j)=s(k) 

2  continue 


do  3  j=l,100 
jbar=mod(j-l,4) 
if  (jbar  .eq.  1)  then 
s(J)=so(j+l) 
else 

if  (jbar  .eq.  2  )  then 
s(j)=so(j-l) 
else 

s(j)=so(j) 
endif 
endif 
3  continue 

open(unit=15,file='P2',status='new') 
write(15,*)(s(j),j=l,100) 
close(15) 

end 
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Program  genp3.f 

integer  s(100),so(100),r 

This  is  the  program  used  to  generate  the  permutation  table  P3 

do  2  j=l,100 
r=mod(j-l,5) 
if  (r  .eq.  2)  then 
s(j)=j+l 
else 
if  (  r  .eq.  3  )  then 

s(j)=j-l 
else 

s(J)H 

endif 

endif 
continue 

do  1  j=l,100 

r=mod(j-l,5) 
if  (  r  .eq.  0  )  so(j)=s(j) 
if  (  r  .eq.  1  )  so(j)=s(j+3) 
if  (  r  .eq.  2  )  so(j)=s(j-l) 
if  (  r  .eq.  3  )  so(j)=s(j) 
if  (  r  .eq.  4  )  so(j)=s(j-2) 
continue 

print  *,(so(j),j=l,100) 

open(unit=15,file='P3',status='new') 

write(15,*)(so(j),j=l,100) 

close(15) 

end 


************************************************************** 
Program  genp4.f 

integer  s(100),so(l00),r 

This  is  the  program  to  generate  the  permutation  table  P4 

do  1  j=l,100 

r=mod(j-l,5) 

if  (  r  .eq.  0  )  s(j)=j 

if  (  r  .eq.  1  )  s(j)=j+2 

if  (  r  .eq.  2  )  s(j)=j+2 

if  (  r  .eq.  3  )  s(j)=j-l 

if  (  r  .eq.  4  )  s(j)=j-3 
continue 


206 


do  2  j=l,100 

r=mod(j-l,5) 
if  (r  .eq.  2)  then 
so(j)=s(j+l) 
else 

if  (  r  .eq.  3  )  then 
so(j)=s(j-l) 
else 

so(j)=s(j) 
end  if 
endif 
2  continue 

open(unit=15,file='P4\status='new') 

write(l5,*)(so(j),j=l,100) 

end 

C  Program  genp5.f 

integer  s(100),r,q 

C  This  is  the  program  to  generate  the  permutation  table  P5 

C  125  tensor  P(5,5)  5q+r  ->q+5r 

do  1  j=l,25 
jbar  =j-l 
r=mod(jbar,5) 
q=(jbar-r)/5 
k=q+5*r+l 
s(j)=k 

s(j+25)=k+25 
s(j+50)=k+50 
s(j+75)=k+75 
1  continue 

open(unit=15,file='P5',status='ne'w') 

write(15,*)(s(j),j=l,100) 

end 

C  Program  genp6.f 

integer  s(100),so(100),r,q 

C  This  is  the  program  to  generate  the  permutation  table  P6 

C  Shuffle  perm  25q+r  ->  q+4r 

do  1  j=l,100 
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r=mod(j-l,25) 
q=(j-l-r)/25 

k=q+4*r+l 

s(j)=k 

continue 

do  2  j=l,100 
if  (j  .le.  25  )  then 
so(j)=s(j) 
else 

jbar=j-l 
r=mod(jbar,25) 
q=((jbar-r)/25)*6 
if  (  r+q  .It.  25  )  then 

so(j)=s(j+q) 
else 

q=25-q 

so(j)=s(j-q) 
endif 
end  if 
continue 

open(unit=15,file='P6',status='new') 

write(l5,*)(so(j)J=l,100) 

close(l5) 

end 
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This  concludes  the  presentation  of  programs  used  to  implement  some  of  the  decom- 
positions developed  in  this  dissertation.  The  programs  illustrate  how  oblique  elimination 
can  be  used  in  conjunction  with  FFT-based  methods  to  develop  good  local  algorithms  for 
computing  DFT's.  They  also  illustrate  how  easy  it  is  to  write  a  program  for  computing 
DFT's  of  any  blocksize  (up  to  some  upper  bound)  locally.  The  latter  programs  also  illus- 
trate how  the  image  algebra  can  be  used  as  a  basis  for  an  algebraically  based  language 
which  supports  highly  parallel  operations. 
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